PySpark CheatSheet Edureka

PySpark RDDs are a distributed memory abstraction that helps perform in-memory computations on large clusters in a fault-tolerant manner. RDDs can be initialized by starting PySpark and creating a SparkContext. Transformations like map, filter, and flatMap are used to manipulate RDDs, while actions like reduce, count, and take return values after computations. Common operations include sorting, grouping, joining, and set operations on RDDs.

Uploaded by

BL Pipas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

320 views1 page

PySpark CheatSheet Edureka

Uploaded by

BL Pipas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

PYSPARK RDD CHEAT SHEET Learn PySpark at www.edureka.

PySpark RDD Transformations

Data Loading and Actions Sorting
Data and Set Operations
Loading
Resilient Distributed Datasets (RDDs) are a distributed Transformations Sorting and Grouping
memory abstraction that helps a programmer to perform
in-memory computations on large clusters that too in a map sortBy
fault-tolerant manner. >>> rdd = sc.parallelize(["b", "a", "c"]) >>> tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
>>> rdd.map(lambda x: (x, 1)) >>> sc.parallelize(tmp).sortBy(lambda x: x[0]).collect()
[('a', 1), ('b', 1), ('c', 1)] [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
Initialization flatMap sortByKey
Let’s see how to start Pyspark and enter the shell >>> rdd = sc.parallelize([2, 3, 4]) >>> tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
• Go to the folder where Pyspark is installed >>> rdd.flatMap(lambda x: range(1, x)) >>> sc.parallelize(tmp).sortByKey(True, 1).collect()
• Run the following command [1, 1, 1, 2, 2, 3] [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]

$ ./sbin/start-all.sh mapPartitions groupBy

$ spark-shell >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
>>> def f(iterator): yield sum(iterator) >>> result = rdd.groupBy(lambda x: x % 2).collect()
Now that spark is up and running, we need to initialize spark context, >>> rdd.mapPartitions(f).collect() >>> sorted([(x, sorted(y)) for (x, y) in result])
which is the heart of any spark application. [3, 7] [(0, [2, 8]), (1, [1, 1, 3, 5])]

>>> from pyspark import SparkContext ﬁlter groupByKey

>>> sc = SparkContext(master = 'local[2]') >>> rdd = sc.parallelize([1, 2, 3, 4, 5]) >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> rdd.filter(lambda x: x % 2 == 0).collect() >>> map((lambda (x,y): (x, list(y))), sorted(x.groupByKey().collect()))
[2, 4] [('a', [1, 1]), ('b', [1])
Conﬁgurations distinct fold
>>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect()) >>> from operator import add
>>> from pyspark import SparkConf, SparkContext [1, 2, 3] >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
>>> val conf = (SparkConf() 15
.setMaster("local[2]") Actions
.setAppName("Edureka CheatSheet") Set Operations
.set("spark.executor.memory", "1g")) reduce
>>> val sc = SparkContext(conf = conf) >>> from operator import add
_add_
>>> rdd = sc.parallelize([1, 1, 2, 3])
>>> sc.parallelize([1, 2, 3, 4, 5]).reduce(add)
>>> (rdd + rdd).collect()
15
Spark Context Inspection >>> sc.parallelize((2 for _ in range(10))).map(lambda x: 1)
[1, 1, 2, 3, 1, 1, 2, 3]

Now once, spark context is initialized, it’s time to check if all the .cache().reduce(add) subtract
10 >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
versions are correct or not. We need to check the default parameters >>> y = sc.parallelize([("a", 3), ("c", None)])
being used by SparkContext. count
>>> sorted(x.subtract(y).collect())
>>> sc.parallelize([2, 3, 4]).count()
[('a', 1), ('b', 4), ('b', 5)]
>>> sc.version # SparkContext Version 3
>>> sc.pythonVer # Python Version ﬁrst unioin
>>> sc.parallelize([2, 3, 4]).first() >>> rdd = sc.parallelize([1, 1, 2, 3])
>>> sc.appName # Application Name
2 >>> rdd.union(rdd).collect()
>>> sc.applicationId # Application ID
take [1, 1, 2, 3, 1, 1, 2, 3]
>>> sc.master # Master URL
>>> str(sc.sparkHome) # Installed Spark Path >>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2) intersection
>>> str(sc.sparkUser()) # Retreive Current SparkContext User [2, 3] >>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
>>> sc.defaultParallelism # Get default level of Parallelism countByValue >>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
>>> sc.defaultMinPartitions # Get minimum number of Partitions >>> sorted(sc.parallelize([1, 2, 1, 2, 2], 2).countByValue().items()) >>> rdd1.intersection(rdd2).collect()
[(1, 2), (2, 3)] [1, 2, 3]
cartesian
Data Loading >>> rdd = sc.parallelize([1, 2])
>>> sorted(rdd.cartesian(rdd).collect())
Creating RDDs [(1, 1), (1, 2), (2, 1), (2, 2)]

Using Parallelized Collections

>>> rdd = sc.parallelize([('Jim',24),('Hope', 25),('Sue', 26)]) Sorting
Data and Set Operations
Loading
>>> rdd = sc.parallelize([('a',9),('b',7),('c',10)])
>>> num_rdd = sc.parallelize(range(1,5000)) saveAsTextFile
From other RDDs >>> rdd.saveAsTextFile("rdd.txt")
>>> new_rdd = rdd.groupByKey() saveAsHadoopFile
>>> new_rdd = rdd.map(lambda x: (x,1)) >>> rdd.saveAsHadoopFile
("hdfs://namenodehost/parent_folder/child_folder",
From a text File 'org.apache.hadoop.mapred.TextOutputFormat')
>>> tfile_rdd = sc.textFile("/path/of_file/*.txt") saveAsPickleFile
Reading directory of Text Files >>> tmpFile = NamedTemporaryFile(delete=True)
>>> tfile_rdd = sc.wholeTextFiles("/path/of_directory/") >>> tmpFile.close()
>>> sc.parallelize([1, 2, 'spark', 'rdd'])
.saveAsPickleFile(tmpFile.name, 3)
RDD Statistics >>> sorted(sc.pickleFile(tmpFile.name, 5).collect())
[1, 2, 'rdd', 'spark']
Maximum Value of RDD elements Standard Deviation of RDD elements PYSPARK CERTIFICATION
>>> rdd.max() >>> rdd.stdev()
Minimum Value of RDD elements Get the Summary Statistics TRAINING Stopping SparkContext and Spark Daemons
>>> rdd.min() Count, Mean, Stdev, Max & Min Stopping SparkContext
>>> rdd.stats() >>> sc.stop()
Mean value of RDD elements Number of Partitions Stopping Spark Daemons
>>> rdd.mean() >>> rdd.getNumPartitions() $ ./sbin/stop-all.sh

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Practical Data Science
No ratings yet
Practical Data Science
121 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Python Interview Question
No ratings yet
Python Interview Question
39 pages
PySpark 30 Days Practice Guide?
100% (1)
PySpark 30 Days Practice Guide?
35 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Spark QA
No ratings yet
Spark QA
34 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Computer Systems / Servicing
0% (1)
Computer Systems / Servicing
406 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Python (3 (Cheatsheet ( (The (Basics) : Variables Text$ (Strings) Interact$with$the$user$ (Input$and$output)
No ratings yet
Python (3 (Cheatsheet ( (The (Basics) : Variables Text$ (Strings) Interact$with$the$user$ (Input$and$output)
1 page
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Bootstrap Cheat Sheet: If You Plan To Pick Up Some Coding Skills, Bootstrap 4 Is A Solid Choice!
100% (2)
Bootstrap Cheat Sheet: If You Plan To Pick Up Some Coding Skills, Bootstrap 4 Is A Solid Choice!
50 pages
Py Spark
No ratings yet
Py Spark
427 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Java OOP-Cheat Sheet
No ratings yet
Java OOP-Cheat Sheet
1 page
Spark in Production
No ratings yet
Spark in Production
34 pages
html5 Cheatsheet 2019 PDF
100% (1)
html5 Cheatsheet 2019 PDF
4 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Huawei HG553
100% (2)
Huawei HG553
2 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
JavaString Cheatsheet Edureka
No ratings yet
JavaString Cheatsheet Edureka
1 page
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Pandas Guide
No ratings yet
Pandas Guide
64 pages
HTML 5 Cheat Sheet: by Via
No ratings yet
HTML 5 Cheat Sheet: by Via
4 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
MT29F4G01ABAFDWB
No ratings yet
MT29F4G01ABAFDWB
64 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
AZ 204 Demo
No ratings yet
AZ 204 Demo
19 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Pyspark With Docker
100% (1)
Pyspark With Docker
15 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
3 Assembler Directives
No ratings yet
3 Assembler Directives
73 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Parallel Bus Device Protocols - Pci Bus: Lesson - 22
No ratings yet
Parallel Bus Device Protocols - Pci Bus: Lesson - 22
37 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Panda Cheatsheet
No ratings yet
Panda Cheatsheet
17 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Dokumen - Tips - Visualstyler For Windows Forms PDF
No ratings yet
Dokumen - Tips - Visualstyler For Windows Forms PDF
44 pages
Datasheet STM32F407
No ratings yet
Datasheet STM32F407
180 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Core Topics in Windows Driver Development
100% (1)
Core Topics in Windows Driver Development
106 pages
Citra Log
No ratings yet
Citra Log
23 pages
Chapter 7 - ACL Concepts
No ratings yet
Chapter 7 - ACL Concepts
35 pages
Mikrotik Firewall
0% (1)
Mikrotik Firewall
5 pages
css3 Cheatsheet 2017 Emezeta PDF
No ratings yet
css3 Cheatsheet 2017 Emezeta PDF
6 pages
CPPM Quick Start Guide
No ratings yet
CPPM Quick Start Guide
28 pages
Mongo DB
No ratings yet
Mongo DB
30 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Fortigate VM
No ratings yet
Fortigate VM
22 pages
Azure Data Factory
No ratings yet
Azure Data Factory
47 pages
VCS-279.examcollection - Premium.exam.89q HLNTGXN PDF
No ratings yet
VCS-279.examcollection - Premium.exam.89q HLNTGXN PDF
31 pages
Sun ZFS Storage 7000 Appliance Install, Admin & Hands On Lab - 2 PDF
No ratings yet
Sun ZFS Storage 7000 Appliance Install, Admin & Hands On Lab - 2 PDF
33 pages
Dataeng-Zoomcamp - 4 - Analytics - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 4 - Analytics - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
26 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Types of Computer Language
No ratings yet
Types of Computer Language
12 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Quality of Service
No ratings yet
Quality of Service
25 pages
Digi Comp 1 Parts
No ratings yet
Digi Comp 1 Parts
2 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Modbus TCP Server Native FP5043TN
No ratings yet
Modbus TCP Server Native FP5043TN
7 pages
Bios Lenovo
No ratings yet
Bios Lenovo
9 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
A. Case Fans
No ratings yet
A. Case Fans
7 pages
PySpark RDD Cheat Sheet
No ratings yet
PySpark RDD Cheat Sheet
1 page
TIL 1480-2 PROM Upgrade Instruction - Rev4
No ratings yet
TIL 1480-2 PROM Upgrade Instruction - Rev4
3 pages
The Git Workflow: Working Directory
No ratings yet
The Git Workflow: Working Directory
7 pages
Onsite SEO Cheatsheet Cheat Sheet: by Via
No ratings yet
Onsite SEO Cheatsheet Cheat Sheet: by Via
3 pages
Excel 2019 Advanced: Quick Reference Guide
No ratings yet
Excel 2019 Advanced: Quick Reference Guide
3 pages
Managing Cluster Resources of The M2000 HA System: About This Chapter
No ratings yet
Managing Cluster Resources of The M2000 HA System: About This Chapter
19 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Excel 2019 Intermediate: Quick Reference Guide
No ratings yet
Excel 2019 Intermediate: Quick Reference Guide
3 pages
Excel 2019 Basic: Quick Reference Guide
No ratings yet
Excel 2019 Basic: Quick Reference Guide
3 pages
Cs Cheat Sheet
No ratings yet
Cs Cheat Sheet
3 pages
Java Exceptions Cheat Sheet: Error Vs Exception
No ratings yet
Java Exceptions Cheat Sheet: Error Vs Exception
1 page
Git Cheat Sheet: by Via
No ratings yet
Git Cheat Sheet: by Via
1 page
861 Dell Precision 7540 Spec Sheet (89544) KRZ-20191128
No ratings yet
861 Dell Precision 7540 Spec Sheet (89544) KRZ-20191128
1 page
.Acc Config Backup
No ratings yet
.Acc Config Backup
7 pages
Lexmark Firmware Job Aid
No ratings yet
Lexmark Firmware Job Aid
3 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Origin AND ATA Standards: by Er - Swarnpreet Singh
No ratings yet
Origin AND ATA Standards: by Er - Swarnpreet Singh
17 pages
Bio Command Software
No ratings yet
Bio Command Software
4 pages

PySpark CheatSheet Edureka

Uploaded by

PySpark CheatSheet Edureka

Uploaded by

PYSPARK RDD CHEAT SHEET Learn PySpark at www.edureka.

PySpark RDD Transformations

$ ./sbin/start-all.sh mapPartitions groupBy

>>> from pyspark import SparkContext ﬁlter groupByKey

Using Parallelized Collections

You might also like