Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
Spark Overview
Spark is an application that was designed to process large
amounts of data. Originally designed for creating data
pipelines for machine learning workloads, Spark is
capable of querying, transforming, and analyzing big data
on a variety of data systems.
Pyspark Overview
The Spark framework is written in Scala but can be used
in several languages, namely Python, Java, SQL, and R.
Pyspark is the Python API for Spark that can be installed
directly from the leading Python repositories (PyPI and
conda). Pyspark is a particularly popular framework
because it makes the big data processing of Spark
available to Python programmers. Python is a more
approachable and familiar language for many data
practitioners than Scala.
Properties of RDDs
The three key properties of RDDs:
Fault-tolerant (resilient): data is recoverable in
the event of failure
Partitioned (distributed): datasets are cut up and
distributed to nodes
Operated on in parallel (parallelization): tasks are
executed on all the chunks of data at the same
time
Transforming an RDD
A transformation is a Spark operation that takes an # start a new SparkSession
existing RDD as an input and provides a new RDD that has
from pyspark.sql import SparkSession
been modified by the transformation as an output.
spark = SparkSession.builder.getOrCreate()
# create an RDD
tiny_rdd =
spark.sparkContext.parallelize([1,2,3,4,5]
)
# transform tiny_rdd
transformed_tiny_rdd = tiny_rdd.map(lambda
x: x+1) # apply x+1 to all RDD elements
# transform rdd
transformed_rdd = rdd.map(lambda x: x*2) #
multiply each RDD element by 2
# create an RDD
rdd =
spark.sparkContext.parallelize([1,2,3,4,5]
)
# execute action
print(rdd.count())
# output:
# 5
Spark Transformations are Lazy
Transformations in Spark are not performed until an
action is called. Spark optimizes and reduces overhead
once it has the full list of transformations to perform. This
behavior is called lazy evaluation. In contrast, eager
evaluation is how Pandas transformations behave.
Viewing RDDs
Two common functions used to view RDDs are: # start a new SparkSession
1. .collect(), which pulls the entire RDD into
from pyspark.sql import SparkSession
memory. This method will probably max out our
memory if the RDD is big. spark = SparkSession.builder.getOrCreate()
2. .take(n), which will only pull in the first n
elements of the RDD into memory.
# create an RDD
rdd =
spark.sparkContext.parallelize([1,2,3,4,5]
)
rdd.take(2)
# output: [1, 2]
Reducing RDDs
When executing .reduce() on an RDD, the reducing # start a new SparkSession
function must be both commutative and associative due
from pyspark.sql import SparkSession
to the fact that RDDs are partitioned and sent to different
nodes. Enforcing these two properties will guarantee that spark = SparkSession.builder.getOrCreate()
parallelized tasks can be executed and completed in any
order without affecting the output. Examples of
# create an RDD
operations with these properties include addition and
multiplication. rdd =
spark.sparkContext.parallelize([1,2,3,4,5]
)
print(counter)
# output: 5
Sharing Broadcast Variables
In Spark, broadcast variables are cached input datasets # start a new SparkSession
that are sent to each node. This provides a performance
from pyspark.sql import SparkSession
boost when running operations that utilize the
broadcasted dataset since all nodes have access to the spark = SparkSession.builder.getOrCreate()
same data. We would never want to broadcast large
amounts of data because the size would be too much to
# create an RDD
serialize and send through the network.
rdd =
spark.sparkContext.parallelize(["Plane",
"Plane", "Boat", "Car", "Car", "Boat",
"Plane"])
# dictionary to broadcast
travel = {"Plane":"Air", "Boat":"Sea",
"Car":"Ground"}
Print Share