PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
In this PySpark Tutorial (Spark with Python) with examples, you will learn what is
PySpark? its features, advantages, modules, packages, and how to use RDD & DataFrame
with sample examples in Python code.
Advertisements
Advertisements
Advertisements
All Spark examples provided in this PySpark (Spark with Python) tutorial are basic,
simple, and easy to practice for beginners who are enthusiastic to learn PySpark and
advance their careers in BigData and Machine Learning.
0:00
2.
0:00
Note: In case you can’t find the PySpark examples you are looking for on this tutorial
page, I would recommend using the Search option from the menu bar to find your
tutorial and sample example code. There are hundreds of tutorials in Spark, Scala,
PySpark, and Python on this website you can learn from.
If you are working with a smaller Dataset and don’t have a Spark cluster, but still you
wanted to get benefits similar to Spark DataFrame, you can use Python pandas
DataFrames. The main difference is pandas DataFrame is not distributed and run on a
single node.
What is PySpark
Introduction
Who uses PySpark
Features
Advantages
PySpark Architecture
Cluster Manager Types
Modules and Packages
PySpark Installation on windows
Spyder IDE & Jupyter Notebook
PySpark RDD
RDD creation
RDD operations
PySpark DataFrame
Is PySpark faster than pandas?
DataFrame creation
DataFrame Operations
DataFrame external data sources
Supported file formats
PySpark SQL
PySpark Streaming
Streaming from TCP Socket
Streaming from Kafka
PySpark GraphFrames
GraphX vs GraphFrames
What is PySpark?
Before we jump into the PySpark tutorial, first, let’s understand what is PySpark and how
it is related to Python? who uses PySpark and it’s advantages.
Introduction
PySpark is a Spark library written in Python to run Python applications using Apache
Spark capabilities, using PySpark we can run applications parallelly on the distributed
cluster (multiple nodes).
In other words, PySpark is a Python API for Apache Spark. Apache Spark is an analytical
processing engine for large scale powerful distributed data processing and machine
learning applications.
source: https://databricks.com/
Spark basically written in Scala and later on due to its industry adaptation it’s API
PySpark released for Python using Py4J. Py4J is a Java library that is integrated within
PySpark and allows python to dynamically interface with JVM objects, hence to run
PySpark you also need Java to be installed along with Python, and Apache Spark.
Additionally, For the development, you can use Anaconda distribution (widely used in the
Machine Learning community) which comes with a lot of useful tools like Spyder IDE,
Jupyter notebook to run PySpark applications.
In real-time, PySpark has used a lot in the machine learning & Data scientists community;
thanks to vast python machine learning libraries. Spark runs operations on billions and
trillions of data on distributed clusters 100 times faster than the traditional python
applications.
PySpark is very well used in Data Science and Machine Learning community as there are
many widely used data science libraries written in Python including NumPy, TensorFlow.
Also used due to its efficient processing of large datasets. PySpark has been used by many
organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.
Features
PySpark Features
In-memory computation
Distributed processing using parallelize
Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
Fault-tolerant
Immutable
Lazy evaluation
Cache & persistence
Inbuild-optimization when using DataFrames
Supports ANSI SQL
Advantages of PySpark
PySpark Architecture
Apache Spark works in a master-slave architecture where the master is called “Driver”
and slaves are called “Workers”. When you run a Spark application, Spark Driver creates
a context that is an entry point to your application, and all operations (transformations
and actions) are executed on worker nodes, and the resources are managed by Cluster
Manager.
source: https://spark.apache.org/
As of writing this Spark with Python (PySpark) tutorial, Spark supports below cluster
managers:
Standalone – a simple cluster manager included with Spark that makes it easy to set up
a cluster.
Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and
PySpark applications.
Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster
manager.
Kubernetes – an open-source system for automating deployment, scaling, and
management of containerized applications.
local – which is not really a cluster manager but still I wanted to mention as we use
“local” for master() in order to run Spark on your laptop/computer.
Besides these, if you wanted to use third-party libraries, you can find them at
https://spark-packages.org/ . This page is kind of a repository of all Spark third-party
libraries.
PySpark Installation
In order to run PySpark examples mentioned in this tutorial, you need to have Python,
Spark and it’s needed tools to be installed on your computer. Since most developers use
Windows for development, I will explain how to install PySpark on windows.
Download and install either Python from Python.org or Anaconda distribution which
includes Python, Spyder IDE, and Jupyter notebook. I would recommend using Anaconda
as it’s popular and used by the Machine Learning & Data science community. Follow
instructions to Install Anaconda Distribution and Jupyter Notebook.
Install Java 8
To run PySpark application, you would need Java 8 or later version hence download the
Java version from Oracle and install it on your system.
Download Apache spark by accessing Spark Download page and select the link from
“Download Spark (point 3)”. If you wanted to use a different version of Spark & Hadoop,
select the one you wanted from drop downs and the link on point 3 changes to the
selected version and provides you with an updated link to download.
After download, untar the binary using 7zip and copy the underlying folder spark-3.0.0-
bin-hadoop2.7 to c:\apps
SPARK_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
HADOOP_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
PATH=%PATH%;C:\apps\spark-3.0.0-bin-hadoop2.7\bin
Setup winutils.exe
Download winutils.exe file from winutils, and copy it to %SPARK_HOME%\bin folder. Winutils
are different for each Hadoop version hence download the right version from
https://github.com/steveloughran/winutils
PySpark shell
Now open the command prompt and type pyspark command to run the PySpark shell.
$SPARK_HOME/sbin/pyspark
Spark-shell also creates a Spark context web UI and by default, it can access from
http://localhost:4041.
Spark Web UI
Spark Web UI
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path
$SPARK_HOME/sbin/start-history-server.sh
If you are running Spark on windows, you can start the history server by starting the
below command.
$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer
By clicking on each App ID, you will get the details of the application in PySpark web UI.
To write PySpark applications, you would need an IDE, there are 10’s of IDE to work with
and I choose to use Spyder IDE and Jupyter notebook. If you have not installed Spyder IDE
and Jupyter notebook along with Anaconda distribution, install these before you proceed.
In this section of the PySpark tutorial, I will introduce the RDD and explains how to create
them, and use its transformation and action operations with examples. Here is the full
article on PySpark RDD in case if you wanted to learn more of and get your fundamentals
strong.
RDD Creation
In order to create an RDD, first, you need to create a SparkSession which is an entry point
to the PySpark application. SparkSession can be created using a builder() or newSession()
methods of the SparkSession.
Spark session internally creates a sparkContext variable of SparkContext. You can create
multiple SparkSession objects but only one SparkContext per JVM. In case if you want to
create another new SparkContext you should stop existing Sparkcontext (using stop())
before creating a new one.
# Import SparkSession
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
using parallelize()
SparkContext has several functions to use with RDDs. For example, it’s parallelize()
method is used to create an RDD from a list.
using textFile()
RDD can also be created from a text file using textFile() function of the SparkContext.
Once you have an RDD, you can perform transformation and action operations. Any
operation you perform on RDD runs in parallel.
RDD Operations
RDD actions – operations that trigger computation and return RDD values to the driver.
RDD Transformations
Transformations on Spark RDD returns another RDD and transformations are lazy
meaning they don’t execute until you call an action on RDD. Some transformations on
RDD’s are flatMap(), map(), reduceByKey(), filter(), sortByKey() and return new RDD
instead of updating the current.
RDD Actions
RDD Action operation returns the values from an RDD to a driver node. In other words,
any RDD function that returns non RDD[T] is considered as an action.
Some actions on RDDs are count(), collect(), first(), max(), reduce() and more.
PySpark DataFrame
DataFrame definition is very well explained by Databricks hence I do not want to define it
again and confuse you. Below is the definition I took it from Databricks.
– Databricks
If you are coming from a Python background I would assume you already know what
Pandas DataFrame is; PySpark DataFrame is mostly similar to Pandas DataFrame with the
exception PySpark DataFrames are distributed in the cluster (meaning the data in
DataFrame’s are stored in different machines in a cluster) and any operations in PySpark
executes in parallel on all machines whereas Panda Dataframe stores and operates on a
single machine.
If you have no Python background, I would recommend you learn some basics on Python
before you proceeding this Spark tutorial. For now, just know that data in PySpark
DataFrame’s are stored in different machines in a cluster.
Due to parallel execution on all cores on multiple machines, PySpark runs operations
faster then pandas. In other words, pandas DataFrames run operations on a single node
whereas PySpark runs on multiple machines. To know more read at pandas DataFrame
vs PySpark Differences with Examples.
DataFrame creation
The simplest way to create a DataFrame is from a Python list of data. DataFrame can also
be created from an RDD and by reading files from several sources.
using createDataFrame()
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
Since DataFrame’s are structure format which contains names and columns, we can get
the schema of the DataFrame using df.printSchema()
+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|dob |gender|salary|
+---------+----------+--------+----------+------+------+
|James | |Smith |1991-04-01|M |3000 |
|Michael |Rose | |2000-05-19|M |4000 |
|Robert | |Williams|1978-09-05|M |4000 |
|Maria |Anne |Jones |1967-12-01|F |4000 |
|Jen |Mary |Brown |1980-02-17|F |-1 |
+---------+----------+--------+----------+------+------+
DataFrame operations
Like RDD, DataFrame also has operations like Transformations and Actions.
In real-time applications, DataFrames are created from external sources like files from
the local system, HDFS, S3 Azure, HBase, MySQL table e.t.c. Below is an example of how to
read a CSV file from a local system.
df = spark.read.csv("/tmp/resources/zipcodes.csv")
df.printSchema()
DataFrame has a rich set of API which supports reading and writing several file formats
csv
text
Avro
Parquet
tsv
xml and many more
DataFrame Examples
In this section of the PySpark Tutorial, you will find several Spark examples written in
Python that help in your projects.
PySpark SQL is one of the most used PySpark modules which is used for processing
structured columnar data format. Once you have a DataFrame created, you can interact
with the data by using SQL syntax.
In other words, Spark SQL brings native RAW SQL queries on Spark meaning you can run
traditional ANSI SQL’s on Spark Dataframe, in the later section of this PySpark SQL
tutorial, you will learn in detail using SQL select, where, group by, join, union e.t.c
Use sql() method of the SparkSession object to run the query and this method returns a
new DataFrame.
df.createOrReplaceTempView("PERSON_DATA")
df2 = spark.sql("SELECT * from PERSON_DATA")
df2.printSchema()
df2.show()
Similarly, you can run any traditional SQL queries on DataFrame’s using PySpark SQL.
source: https://spark.apache.org/
Use readStream.format("socket") from Spark session object to read data from the socket
and provide options host and port where you want to stream data from.
df = spark.readStream
.format("socket")
.option("host","localhost")
.option("port","9090")
.load()
Spark reads the data from the socket and represents it in a “value” column of
DataFrame. df.printSchema() outputs
root
|-- value: string (nullable = true)
After processing, you can stream the DataFrame to console. In real-time, we ideally
stream it to either Kafka, database e.t.c
query = count.writeStream
.format("console")
.outputMode("complete")
.start()
.awaitTermination()
Using Spark Streaming we can read from Kafka topic and write to Kafka topic in TEXT,
CSV, AVRO and JSON formats
df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.100:9092")
.option("subscribe", "json_topic")
.option("startingOffsets", "earliest") // From starting
.load()
Below pyspark example, writes message to another topic in Kafka using writeStream()