0% found this document useful (0 votes)

17 views6 pages

Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy

The document provides an introduction to Big Data processing using PySpark, highlighting Spark's ability to handle large datasets efficiently through in-memory processing. It covers key concepts such as Resilient Distributed Datasets (RDDs), transformations, actions, and the use of lambdas in Spark operations. Additionally, it explains the importance of lazy evaluation, accumulators, and broadcast variables in optimizing performance during data processing.

Uploaded by

ARNAB DUTTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views6 pages

Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy

Uploaded by

ARNAB DUTTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Cheatsheets / Introduction to Big Data with PySpark

Spark RDDs with PySpark

Spark Overview
Spark is an application that was designed to process large
amounts of data. Originally designed for creating data
pipelines for machine learning workloads, Spark is
capable of querying, transforming, and analyzing big data
on a variety of data systems.

Spark Process Overview

Spark is able to process data quickly because it leverages
the Random Access Memory (RAM) of a computing
cluster. When processing data, Spark keeps the data in
RAM, which is a faster processing part of a computing
node. Spark does this in parallel across all worker nodes
in a cluster. This differs from MapReduce, which
processes data on the node’s disk, and explains why Spark
is a faster framework than MapReduce.

Pyspark Overview
The Spark framework is written in Scala but can be used
in several languages, namely Python, Java, SQL, and R.
Pyspark is the Python API for Spark that can be installed
directly from the leading Python repositories (PyPI and
conda). Pyspark is a particularly popular framework
because it makes the big data processing of Spark
available to Python programmers. Python is a more
approachable and familiar language for many data
practitioners than Scala.
Properties of RDDs
The three key properties of RDDs:
Fault-tolerant (resilient): data is recoverable in
the event of failure
Partitioned (distributed): datasets are cut up and
distributed to nodes
Operated on in parallel (parallelization): tasks are
executed on all the chunks of data at the same
time

Transforming an RDD
A transformation is a Spark operation that takes an # start a new SparkSession
existing RDD as an input and provides a new RDD that has
from pyspark.sql import SparkSession
been modified by the transformation as an output.
spark = SparkSession.builder.getOrCreate()

# create an RDD
tiny_rdd =
spark.sparkContext.parallelize([1,2,3,4,5]
)

# transform tiny_rdd
transformed_tiny_rdd = tiny_rdd.map(lambda
x: x+1) # apply x+1 to all RDD elements

# view the transformed RDD

transformed_tiny_rdd.collect()
# output:
# [2, 3, 4, 5, 6]
Lambdas in Spark Operations
Lambdas expressions allow us to apply a simple operation # start a new SparkSession
to an object without needing to define it as a function.
from pyspark.sql import SparkSession
This improves readability by condensing what could be a
few lines of code into a single line. Utilizing lambdas in spark = SparkSession.builder.getOrCreate()
Spark operations allows us to apply any arbitrary function
to all RDD elements specified by the transformation or
# create an RDD
action.
rdd =
spark.sparkContext.parallelize([1,2,3,4,5]
)

# transform rdd
transformed_rdd = rdd.map(lambda x: x*2) #
multiply each RDD element by 2

# view the transformed RDD

transformed_rdd.collect()
# output:
# [2,4,6,8,10]

Executing Actions on RDDs

An action is a Spark operation that takes an RDD as input, # start a new SparkSession
but always outputs a value instead of a new RDD.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# create an RDD
rdd =
spark.sparkContext.parallelize([1,2,3,4,5]
)

# execute action
print(rdd.count())

# output:
# 5
Spark Transformations are Lazy
Transformations in Spark are not performed until an
action is called. Spark optimizes and reduces overhead
once it has the full list of transformations to perform. This
behavior is called lazy evaluation. In contrast, eager
evaluation is how Pandas transformations behave.

Viewing RDDs
Two common functions used to view RDDs are: # start a new SparkSession
1. .collect(), which pulls the entire RDD into
from pyspark.sql import SparkSession
memory. This method will probably max out our
memory if the RDD is big. spark = SparkSession.builder.getOrCreate()
2. .take(n), which will only pull in the first n
elements of the RDD into memory.
# create an RDD
rdd =
spark.sparkContext.parallelize([1,2,3,4,5]
)

# we can run collect() on a small RDD

rdd.collect()
# output: [1, 2, 3, 4, 5]

rdd.take(2)
# output: [1, 2]
Reducing RDDs
When executing .reduce() on an RDD, the reducing # start a new SparkSession
function must be both commutative and associative due
from pyspark.sql import SparkSession
to the fact that RDDs are partitioned and sent to different
nodes. Enforcing these two properties will guarantee that spark = SparkSession.builder.getOrCreate()
parallelized tasks can be executed and completed in any
order without affecting the output. Examples of
# create an RDD
operations with these properties include addition and
multiplication. rdd =
spark.sparkContext.parallelize([1,2,3,4,5]
)

# add all elements together

print(rdd.reduce(lambda x,y: x+y))
# output: 15

# multiply all elements together

print(rdd.reduce(lambda x,y: x*y))
# output: 120

Aggregating with Accumulators

Accumulator variables are shared variables that can only # start a new SparkSession
be updated through associative and commutative
from pyspark.sql import SparkSession
operations. They are primarily used as counters or sums
in parallel computing since they operate on each node spark = SparkSession.builder.getOrCreate()
separately and adhere to both the associative and
commutative properties. However, they are only infallible
# create an RDD
when used in actions because Spark transformations can
re-execute after failure, which would wrongfully rdd =
increment the accumulator. spark.sparkContext.parallelize([1,2,3,4,5]
)

# start the accumlator at zero

counter =
spark.sparkContext.accumulator(0)

# add 1 to the accumulator for each

element
rdd.foreach(lambda x: counter.add(1))

print(counter)
# output: 5
Sharing Broadcast Variables
In Spark, broadcast variables are cached input datasets # start a new SparkSession
that are sent to each node. This provides a performance
from pyspark.sql import SparkSession
boost when running operations that utilize the
broadcasted dataset since all nodes have access to the spark = SparkSession.builder.getOrCreate()
same data. We would never want to broadcast large
amounts of data because the size would be too much to
# create an RDD
serialize and send through the network.
rdd =
spark.sparkContext.parallelize(["Plane",
"Plane", "Boat", "Car", "Car", "Boat",
"Plane"])

# dictionary to broadcast
travel = {"Plane":"Air", "Boat":"Sea",
"Car":"Ground"}

# create broadcast variable

broadcast_travel =
spark.sparkContext.broadcast(travel)

# map the broadcast variable to the RDD

result = rdd.map(lambda x:
broadcast_travel.value[x])

# view first four results

result.take(4)
# output : ['Air', 'Air', 'Sea', 'Ground']

Print Share

Angelov Anton Design Patterns For Highquality Automated Test
No ratings yet
Angelov Anton Design Patterns For Highquality Automated Test
315 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Pyspark
No ratings yet
Pyspark
31 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Pyspart Iq
No ratings yet
Pyspart Iq
27 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
SPARK
No ratings yet
SPARK
35 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
2 - Intro To PySpark RDD
No ratings yet
2 - Intro To PySpark RDD
35 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Spark
No ratings yet
Spark
160 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
External Video-En
No ratings yet
External Video-En
2 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Spark
No ratings yet
Spark
96 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Module 3
No ratings yet
Module 3
51 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark (Introduction, RDD)
No ratings yet
Spark (Introduction, RDD)
28 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Py Spark
No ratings yet
Py Spark
9 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Lecture 1
No ratings yet
Lecture 1
44 pages
Bug Bounty Course Content
No ratings yet
Bug Bounty Course Content
5 pages
Mini Apps Ecosystem - Removed
No ratings yet
Mini Apps Ecosystem - Removed
2 pages
Em-Tech-Intro To Ict
No ratings yet
Em-Tech-Intro To Ict
28 pages
Pic
No ratings yet
Pic
12 pages
MGT Micrprjct
No ratings yet
MGT Micrprjct
22 pages
Lan, Wan Pan, Man Star, Ring, Bus, Mesh
No ratings yet
Lan, Wan Pan, Man Star, Ring, Bus, Mesh
4 pages
Thinkpad t540p
No ratings yet
Thinkpad t540p
182 pages
Canon Typestar 6II
No ratings yet
Canon Typestar 6II
65 pages
IT Software Developer
No ratings yet
IT Software Developer
2 pages
Liftnetinstallationmanual
No ratings yet
Liftnetinstallationmanual
15 pages
Iot Lab
No ratings yet
Iot Lab
11 pages
Grade 4 End of Year Agric Science and Tech Paper 2 2022
No ratings yet
Grade 4 End of Year Agric Science and Tech Paper 2 2022
4 pages
Ring Tone Issue
No ratings yet
Ring Tone Issue
4 pages
S7-1200 1500 Webserver DOC v3 en
No ratings yet
S7-1200 1500 Webserver DOC v3 en
34 pages
PLC Interview Questions and Answers
No ratings yet
PLC Interview Questions and Answers
6 pages
Aeternus Brass VST, VST3, Audio Unit Plugins: Virtual Trumpet, Trombone, Tuba, French Horn, Flugelhorn, Cornet, Brass Sections, Orchestral Ensemble and Classic Analog Synthesizer Brasses. EXS24 + KONTAKT (Windows, macOS)
0% (1)
Aeternus Brass VST, VST3, Audio Unit Plugins: Virtual Trumpet, Trombone, Tuba, French Horn, Flugelhorn, Cornet, Brass Sections, Orchestral Ensemble and Classic Analog Synthesizer Brasses. EXS24 + KONTAKT (Windows, macOS)
18 pages
Microsoft 62-193 - ExamTopics
No ratings yet
Microsoft 62-193 - ExamTopics
68 pages
How To Use and Setup Wyze V3 For Frigate Person Detection NVR
No ratings yet
How To Use and Setup Wyze V3 For Frigate Person Detection NVR
4 pages
Experiment No 1
No ratings yet
Experiment No 1
3 pages
FDD and TDD PDF
No ratings yet
FDD and TDD PDF
5 pages
README Scrypt
No ratings yet
README Scrypt
5 pages
Orchid Power Source
No ratings yet
Orchid Power Source
23 pages
Big and Little Endian
No ratings yet
Big and Little Endian
4 pages
CSEC Information Technology Syllabus
No ratings yet
CSEC Information Technology Syllabus
71 pages
Pro-Watch 4.5 Installation Guide
No ratings yet
Pro-Watch 4.5 Installation Guide
46 pages
ML Techmax
100% (1)
ML Techmax
202 pages
Gps Login Details
No ratings yet
Gps Login Details
5 pages
Alat Alat Studio
No ratings yet
Alat Alat Studio
3 pages

Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy

Uploaded by

Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy

Uploaded by

Cheatsheets / Introduction to Big Data with PySpark

Spark RDDs with PySpark

Spark Process Overview

# view the transformed RDD

# view the transformed RDD

Executing Actions on RDDs

# we can run collect() on a small RDD

# add all elements together

# multiply all elements together

Aggregating with Accumulators

# start the accumlator at zero

# add 1 to the accumulator for each

# create broadcast variable

# map the broadcast variable to the RDD

# view first four results

You might also like