0% found this document useful (0 votes)

998 views19 pages

PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)

PySpark is a Python API for Apache Spark that allows users to run Python applications using Spark's capabilities. It provides features like in-memory computation, distributed processing, fault tolerance, and SQL support. Common uses of PySpark include machine learning and processing large datasets. The document provides an overview of PySpark modules, installation instructions, and examples of using RDDs and DataFrames in PySpark.

Uploaded by

pysparkv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

998 views19 pages

PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)

Uploaded by

pysparkv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Spark By {Examples}

PySpark Tutorial For Beginners | Python Examples

Photo by Scott Graham on Unsplash

PySpark Tutorial For Beginners (Spark with Python)

In this PySpark Tutorial (Spark with Python) with examples, you will learn what is
PySpark? its features, advantages, modules, packages, and how to use RDD & DataFrame
with sample examples in Python code.

1. PySpark Tutorial For Beginners (Spa...

0:00

2.
0:00

PySpark Tutorial For Beginners (Spark with Python)

Note: In case you can’t find the PySpark examples you are looking for on this tutorial
page, I would recommend using the Search option from the menu bar to find your
tutorial and sample example code. There are hundreds of tutorials in Spark, Scala,
PySpark, and Python on this website you can learn from.

If you are working with a smaller Dataset and don’t have a Spark cluster, but still you
wanted to get benefits similar to Spark DataFrame, you can use Python pandas
DataFrames. The main difference is pandas DataFrame is not distributed and run on a
single node.

What is PySpark
Introduction
Who uses PySpark
Features
Advantages

PySpark Architecture
Cluster Manager Types
Modules and Packages
PySpark Installation on windows
Spyder IDE & Jupyter Notebook
PySpark RDD
RDD creation
RDD operations

PySpark DataFrame
Is PySpark faster than pandas?
DataFrame creation
DataFrame Operations
DataFrame external data sources
Supported file formats
PySpark SQL
PySpark Streaming
Streaming from TCP Socket
Streaming from Kafka

PySpark GraphFrames
GraphX vs GraphFrames

What is PySpark?

Before we jump into the PySpark tutorial, first, let’s understand what is PySpark and how
it is related to Python? who uses PySpark and it’s advantages.

Introduction

PySpark is a Spark library written in Python to run Python applications using Apache
Spark capabilities, using PySpark we can run applications parallelly on the distributed
cluster (multiple nodes).

In other words, PySpark is a Python API for Apache Spark. Apache Spark is an analytical
processing engine for large scale powerful distributed data processing and machine
learning applications.

source: https://databricks.com/

Spark basically written in Scala and later on due to its industry adaptation it’s API
PySpark released for Python using Py4J. Py4J is a Java library that is integrated within
PySpark and allows python to dynamically interface with JVM objects, hence to run
PySpark you also need Java to be installed along with Python, and Apache Spark.

Additionally, For the development, you can use Anaconda distribution (widely used in the
Machine Learning community) which comes with a lot of useful tools like Spyder IDE,
Jupyter notebook to run PySpark applications.
In real-time, PySpark has used a lot in the machine learning & Data scientists community;
thanks to vast python machine learning libraries. Spark runs operations on billions and
trillions of data on distributed clusters 100 times faster than the traditional python
applications.

Who uses PySpark?

PySpark is very well used in Data Science and Machine Learning community as there are
many widely used data science libraries written in Python including NumPy, TensorFlow.
Also used due to its efficient processing of large datasets. PySpark has been used by many
organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.

Related: How to run Pandas DataFrame on Apache Spark (PySpark)?

Features

Following are the main features of PySpark.

PySpark Features

In-memory computation
Distributed processing using parallelize
Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
Fault-tolerant
Immutable
Lazy evaluation
Cache & persistence
Inbuild-optimization when using DataFrames
Supports ANSI SQL

Advantages of PySpark

PySpark is a general-purpose, in-memory, distributed processing engine that allows you

to process data efficiently in a distributed fashion.
Applications running on PySpark are 100x faster than traditional systems.
You will get great benefits using PySpark for data ingestion pipelines.
Using PySpark we can process data from Hadoop HDFS, AWS S3, and many file systems.
PySpark also is used to process real-time data using Streaming and Kafka.
Using PySpark streaming you can also stream files from the file system and also stream
from the socket.
PySpark natively has machine learning and graph libraries.

PySpark Architecture

Apache Spark works in a master-slave architecture where the master is called “Driver”
and slaves are called “Workers”. When you run a Spark application, Spark Driver creates
a context that is an entry point to your application, and all operations (transformations
and actions) are executed on worker nodes, and the resources are managed by Cluster
Manager.
source: https://spark.apache.org/

Cluster Manager Types

As of writing this Spark with Python (PySpark) tutorial, Spark supports below cluster
managers:

Standalone – a simple cluster manager included with Spark that makes it easy to set up
a cluster.
Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and
PySpark applications.
Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster
manager.
Kubernetes – an open-source system for automating deployment, scaling, and
management of containerized applications.

local – which is not really a cluster manager but still I wanted to mention as we use
“local” for master() in order to run Spark on your laptop/computer.

PySpark Modules & Packages

Modules & packages

PySpark RDD (pyspark.RDD)

PySpark DataFrame and SQL (pyspark.sql)
PySpark Streaming (pyspark.streaming)
PySpark MLib (pyspark.ml, pyspark.mllib)
PySpark GraphFrames (GraphFrames)
PySpark Resource (pyspark.resource) It’s new in PySpark 3.0

Besides these, if you wanted to use third-party libraries, you can find them at
https://spark-packages.org/ . This page is kind of a repository of all Spark third-party
libraries.

PySpark Installation

In order to run PySpark examples mentioned in this tutorial, you need to have Python,
Spark and it’s needed tools to be installed on your computer. Since most developers use
Windows for development, I will explain how to install PySpark on windows.

Install Python or Anaconda distribution

Download and install either Python from Python.org or Anaconda distribution which
includes Python, Spyder IDE, and Jupyter notebook. I would recommend using Anaconda
as it’s popular and used by the Machine Learning & Data science community. Follow
instructions to Install Anaconda Distribution and Jupyter Notebook.

Install Java 8
To run PySpark application, you would need Java 8 or later version hence download the
Java version from Oracle and install it on your system.

Post installation, set JAVA_HOME and PATH variable.

JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201

PATH = %PATH%;C:\Program Files\Java\jdk1.8.0_201\bin

Install Apache Spark

Download Apache spark by accessing Spark Download page and select the link from
“Download Spark (point 3)”. If you wanted to use a different version of Spark & Hadoop,
select the one you wanted from drop downs and the link on point 3 changes to the
selected version and provides you with an updated link to download.

After download, untar the binary using 7zip and copy the underlying folder spark-3.0.0-
bin-hadoop2.7 to c:\apps

Now set the following environment variables.

SPARK_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
HADOOP_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
PATH=%PATH%;C:\apps\spark-3.0.0-bin-hadoop2.7\bin

Setup winutils.exe

Download winutils.exe file from winutils, and copy it to %SPARK_HOME%\bin folder. Winutils
are different for each Hadoop version hence download the right version from
https://github.com/steveloughran/winutils

PySpark shell

Now open the command prompt and type pyspark command to run the PySpark shell.
$SPARK_HOME/sbin/pyspark

You should see something like this below.

Spark-shell also creates a Spark context web UI and by default, it can access from
http://localhost:4041.

Spark Web UI

Apache Spark provides a suite of Web UIs

(Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of
your Spark application, resource consumption of Spark cluster, and Spark configurations.
On Spark Web UI, you can see how the operations are executed.

Spark Web UI

Spark History Server

Spark History servers, keep a log of all Spark applications you submit by spark-submit,
spark-shell. before you start, first you need to set the below config on spark-defaults.conf

spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path

Now, start the spark history server on Linux or Mac by running.

$SPARK_HOME/sbin/start-history-server.sh

If you are running Spark on windows, you can start the history server by starting the
below command.

$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer

Spark History Server

By clicking on each App ID, you will get the details of the application in PySpark web UI.

Spyder IDE & Jupyter Notebook

To write PySpark applications, you would need an IDE, there are 10’s of IDE to work with
and I choose to use Spyder IDE and Jupyter notebook. If you have not installed Spyder IDE
and Jupyter notebook along with Anaconda distribution, install these before you proceed.

Now, set the following environment variable.

PYTHONPATH => %SPARK_HOME%/python;$SPARK_HOME/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%

Now open Spyder IDE and create a new file with the below simple PySpark program and
run it. You should see 5 in output.

PySpark application running on Spyder IDE

Now let’s start the Jupyter Notebook

PySpark statements running on Jupyter Interface

PySpark RDD – Resilient Distributed Dataset

In this section of the PySpark tutorial, I will introduce the RDD and explains how to create
them, and use its transformation and action operations with examples. Here is the full
article on PySpark RDD in case if you wanted to learn more of and get your fundamentals
strong.

PySpark RDD (Resilient Distributed Dataset) is a fundamental data structure of PySpark

that is fault-tolerant, immutable distributed collections of objects, which means once you
create an RDD you cannot change it. Each dataset in RDD is divided into logical partitions,
which can be computed on different nodes of the cluster.

RDD Creation

In order to create an RDD, first, you need to create a SparkSession which is an entry point
to the PySpark application. SparkSession can be created using a builder() or newSession()
methods of the SparkSession.

Spark session internally creates a sparkContext variable of SparkContext. You can create
multiple SparkSession objects but only one SparkContext per JVM. In case if you want to
create another new SparkContext you should stop existing Sparkcontext (using stop())
before creating a new one.

# Import SparkSession
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()

using parallelize()

SparkContext has several functions to use with RDDs. For example, it’s parallelize()
method is used to create an RDD from a list.

# Create RDD from parallelize

dataList = [("Java", 20000), ("Python", 100000), ("Scala", 3000)]
rdd=spark.sparkContext.parallelize(dataList)

using textFile()

RDD can also be created from a text file using textFile() function of the SparkContext.

# Create RDD from external Data source

rdd2 = spark.sparkContext.textFile("/path/test.txt")

Once you have an RDD, you can perform transformation and action operations. Any
operation you perform on RDD runs in parallel.
RDD Operations

On PySpark RDD, you can perform two kinds of operations.

RDD transformations – Transformations are lazy operations. When you run a

transformation(for example update), instead of updating a current RDD, these operations
return another RDD.

RDD actions – operations that trigger computation and return RDD values to the driver.

RDD Transformations

Transformations on Spark RDD returns another RDD and transformations are lazy
meaning they don’t execute until you call an action on RDD. Some transformations on
RDD’s are flatMap(), map(), reduceByKey(), filter(), sortByKey() and return new RDD
instead of updating the current.

RDD Actions

RDD Action operation returns the values from an RDD to a driver node. In other words,
any RDD function that returns non RDD[T] is considered as an action.

Some actions on RDDs are count(), collect(), first(), max(), reduce() and more.

PySpark DataFrame

DataFrame definition is very well explained by Databricks hence I do not want to define it
again and confuse you. Below is the definition I took it from Databricks.

DataFrame is a distributed collection of data organized into named columns. It is

conceptually equivalent to a table in a relational database or a data frame in
R/Python, but with richer optimizations under the hood. DataFrames can be
constructed from a wide array of sources such as structured data files, tables in Hive,
external databases, or existing RDDs.

– Databricks

If you are coming from a Python background I would assume you already know what
Pandas DataFrame is; PySpark DataFrame is mostly similar to Pandas DataFrame with the
exception PySpark DataFrames are distributed in the cluster (meaning the data in
DataFrame’s are stored in different machines in a cluster) and any operations in PySpark
executes in parallel on all machines whereas Panda Dataframe stores and operates on a
single machine.

If you have no Python background, I would recommend you learn some basics on Python
before you proceeding this Spark tutorial. For now, just know that data in PySpark
DataFrame’s are stored in different machines in a cluster.

Is PySpark faster than pandas?

Due to parallel execution on all cores on multiple machines, PySpark runs operations
faster then pandas. In other words, pandas DataFrames run operations on a single node
whereas PySpark runs on multiple machines. To know more read at pandas DataFrame
vs PySpark Differences with Examples.

DataFrame creation

The simplest way to create a DataFrame is from a Python list of data. DataFrame can also
be created from an RDD and by reading files from several sources.

using createDataFrame()

By using createDataFrame() function of the SparkSession you can create a DataFrame.

data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)

Since DataFrame’s are structure format which contains names and columns, we can get
the schema of the DataFrame using df.printSchema()

df.show() shows the 20 elements from the DataFrame.

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|dob |gender|salary|
+---------+----------+--------+----------+------+------+
|James | |Smith |1991-04-01|M |3000 |
|Michael |Rose | |2000-05-19|M |4000 |
|Robert | |Williams|1978-09-05|M |4000 |
|Maria |Anne |Jones |1967-12-01|F |4000 |
|Jen |Mary |Brown |1980-02-17|F |-1 |
+---------+----------+--------+----------+------+------+

DataFrame operations

Like RDD, DataFrame also has operations like Transformations and Actions.

DataFrame from external data sources

In real-time applications, DataFrames are created from external sources like files from
the local system, HDFS, S3 Azure, HBase, MySQL table e.t.c. Below is an example of how to
read a CSV file from a local system.

df = spark.read.csv("/tmp/resources/zipcodes.csv")
df.printSchema()

Supported file formats

DataFrame has a rich set of API which supports reading and writing several file formats

csv
text
Avro
Parquet
tsv
xml and many more

DataFrame Examples

In this section of the PySpark Tutorial, you will find several Spark examples written in
Python that help in your projects.

Different ways to Create DataFrame in PySpark

PySpark – Ways to Rename column on DataFrame
PySpark withColumn() usage with Examples
PySpark – How to Filter data from DataFrame
PySpark orderBy() and sort() explained
PySpark explode array and map columns to rows
PySpark – explode nested array into rows
PySpark Read CSV file into DataFrame
PySpark Groupby Explained with Examples
PySpark Aggregate Functions with Examples
PySpark Joins Explained with Examples

PySpark SQL Tutorial

PySpark SQL is one of the most used PySpark modules which is used for processing
structured columnar data format. Once you have a DataFrame created, you can interact
with the data by using SQL syntax.

In other words, Spark SQL brings native RAW SQL queries on Spark meaning you can run
traditional ANSI SQL’s on Spark Dataframe, in the later section of this PySpark SQL
tutorial, you will learn in detail using SQL select, where, group by, join, union e.t.c

In order to use SQL, first, create a temporary table on DataFrame using

createOrReplaceTempView() function. Once created, this table can be accessed throughout
the SparkSession using sql() and it will be dropped along with your SparkContext
termination.

Use sql() method of the SparkSession object to run the query and this method returns a
new DataFrame.

df.createOrReplaceTempView("PERSON_DATA")
df2 = spark.sql("SELECT * from PERSON_DATA")
df2.printSchema()
df2.show()

Let’s see another pyspark example using group by.

groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")

groupDF.show()

This yields the below output

+------+--------+
|gender|count(1)|
+------+--------+
| F| 2|
| M| 3|
+------+--------+

Similarly, you can run any traditional SQL queries on DataFrame’s using PySpark SQL.

PySpark Streaming Tutorial

PySpark Streaming is a scalable, high-throughput, fault-tolerant streaming processing

system that supports both batch and streaming workloads. It is used to process real-time
data from sources like file system folder, TCP socket, S3, Kafka, Flume, Twitter, and
Amazon Kinesis to name a few. The processed data can be pushed to databases, Kafka,
live dashboards e.t.c

source: https://spark.apache.org/

Streaming from TCP Socket

Use readStream.format("socket") from Spark session object to read data from the socket
and provide options host and port where you want to stream data from.

df = spark.readStream
.format("socket")
.option("host","localhost")
.option("port","9090")
.load()

Spark reads the data from the socket and represents it in a “value” column of
DataFrame. df.printSchema() outputs
root
|-- value: string (nullable = true)

After processing, you can stream the DataFrame to console. In real-time, we ideally
stream it to either Kafka, database e.t.c

query = count.writeStream
.format("console")
.outputMode("complete")
.start()
.awaitTermination()

Streaming from Kafka

Using Spark Streaming we can read from Kafka topic and write to Kafka topic in TEXT,
CSV, AVRO and JSON formats

df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.100:9092")
.option("subscribe", "json_topic")
.option("startingOffsets", "earliest") // From starting
.load()

Below pyspark example, writes message to another topic in Kafka using writeStream()

df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value")

.writeStream
.format("kafka")
.outputMode("append")

Common questions

PySpark offers significant advantages over Pandas DataFrames when dealing with large datasets due to its ability to distribute computation across a cluster of machines. While Pandas is limited to a single node, resulting in potential memory and performance bottlenecks, PySpark operates over multiple nodes leveraging parallel computations. This allows PySpark to handle data sizes that would overwhelm Pandas, facilitating tasks such as data aggregation and performance-tuning across distributed systems, making it more suitable for large-scale data processing .

PySpark's use of Resilient Distributed Datasets (RDDs) enables fault-tolerant in-memory processing of data, resulting in high throughput and low latency. Transformations in RDDs are lazy, allowing operations to be planned and optimized before execution. PySpark DataFrames, on the other hand, provide a structured approach to working with data, akin to tables in SQL, but with rich optimization possibilities. DataFrames allow for easier manipulation and querying, supporting SQL-like functions, which make them more efficient compared to raw data processing approaches .

PySpark facilitates efficient processing of large-scale data through its distributed computing capabilities, allowing operations to run parallelly on multiple nodes in a cluster. This means PySpark can execute operations 100 times faster than traditional Python applications, which typically operate on a single node. Moreover, PySpark leverages in-memory computation and several optimizations inherent in DataFrames for faster data processing .

In real-time data processing, PySpark plays a pivotal role by enabling high-throughput and fault-tolerant processing through PySpark Streaming. It supports integration with various real-time data sources such as Kafka, TCP sockets, and Amazon Kinesis, facilitating real-time analytics by ingesting streaming data directly into PySpark DataFrames. PySpark's established streaming capabilities allow for seamless integration with existing data infrastructure, providing a comprehensive platform for performing real-time analytics and processing tasks efficiently .

PySpark RDD operations are broadly categorized into two types: transformations and actions. Transformations like map(), flatMap(), and filter() create a new RDD and are lazily evaluated, meaning that they do not modify the data immediately but instead create a recipe for transformation applied when an action is called. Actions, such as count(), collect(), and reduce(), trigger the actual computation and return results to the driver program. This separation of transformations and actions allows RDDs to optimize computing resource usage effectively .

Key features of PySpark include in-memory computation, distributed processing, and compatibility with several cluster managers like Spark, Yarn, and Mesos. It also supports fault-tolerance, lazy evaluation, caching, persistence, and ANSI SQL. These features make PySpark efficient for handling large datasets by improving speed and facilitating scalability compared to traditional data processing tools which lack these capabilities .

Integrating PySpark with Jupyter Notebook and Anaconda enhances data science workflows by providing an interactive and flexible development environment. Jupyter Notebook facilitates easy visualization and iterative development, allowing data scientists to write, execute, and document their PySpark code seamlessly. Anaconda, with its curated set of data science packages, complements PySpark by offering tools like Spyder IDE for debugging, and Jupyter for exploratory data analysis, thereby streamlining workflows from data cleaning through modeling to visualization and reporting efficiently .

The PySpark architecture follows a master-slave structure where the Spark Driver acts as the master and the worker nodes act as slaves. The Spark Driver manages all tasks by creating a context that serves as an entry point to application operations. It orchestrates task scheduling and resource allocation across worker nodes, which execute the main computation tasks. A Cluster Manager's role is pivotal in supervising resource allocation, supporting cluster management modes such as Standalone, Yarn, Mesos, and Kubernetes, thus enabling flexible orchestration in computing environments .

PySpark SQL is preferred for processing structured data due to its ability to seamlessly integrate SQL queries with Spark's efficient distributed computing. It provides capabilities like running traditional SQL queries, complex data transformations, and aggregations while maintaining performance optimization through its Catalyst Optimizer. By supporting seamless interaction with structured data formats and enabling the execution of ANSI SQL queries, PySpark SQL allows users to leverage their existing SQL skills for big data processing tasks across scalable environments .

Setting up a PySpark environment under Windows involves several prerequisites including the installation of Python or Anaconda (which includes Python, Spyder IDE, and Jupyter Notebook), Java 8 or later, and downloading the Apache Spark binaries. Configuring environment variables such as JAVA_HOME and PATH is required for streamlining execution. This setup is critical in machine learning as it allows leveraging PySpark's distributed computing capabilities to handle large datasets effectively, facilitating complex machine learning workflows by integrating with Python's extensive library ecosystem .

Apache Spark Interview Questions Guide
100% (1)
Apache Spark Interview Questions Guide
7 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Spark Optimization PDF
50% (2)
Spark Optimization PDF
14 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Understanding Spark Architecture Basics
No ratings yet
Understanding Spark Architecture Basics
25 pages
Spark SQL and Hadoop Integration Guide
100% (1)
Spark SQL and Hadoop Integration Guide
25 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Second Highest Salary in PySpark
No ratings yet
Second Highest Salary in PySpark
22 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Spark Executor Tuning Guide
No ratings yet
Spark Executor Tuning Guide
26 pages
Pyspark Window Functions Overview
100% (1)
Pyspark Window Functions Overview
8 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
PySpark Optimization Interview Scenarios
No ratings yet
PySpark Optimization Interview Scenarios
8 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
Spark Production Insights and Lessons
No ratings yet
Spark Production Insights and Lessons
34 pages
Understanding Apache Spark Architecture
0% (1)
Understanding Apache Spark Architecture
30 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
Key Features of Apache Airflow 2.0
100% (2)
Key Features of Apache Airflow 2.0
39 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Databricks Python & Linux Commands Guide
No ratings yet
Databricks Python & Linux Commands Guide
109 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
7 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Spark QA
No ratings yet
Spark QA
34 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Overview of Spark SQL Features and APIs
No ratings yet
Overview of Spark SQL Features and APIs
24 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Py Spark
No ratings yet
Py Spark
10 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Pyspark Basics and Word Count Examples
0% (1)
Pyspark Basics and Word Count Examples
133 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Spark Repartition vs Coalesce Explained
No ratings yet
Spark Repartition vs Coalesce Explained
7 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
PySpark Tutorial for Beginners
No ratings yet
PySpark Tutorial for Beginners
206 pages
Spark Theory
No ratings yet
Spark Theory
26 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
Yethishag Data Analyst Resume
No ratings yet
Yethishag Data Analyst Resume
10 pages
09 - Azure Data Engineering Cheatsheet
No ratings yet
09 - Azure Data Engineering Cheatsheet
37 pages
Glossary of Cloud Native Terms
No ratings yet
Glossary of Cloud Native Terms
44 pages
CC Exam
No ratings yet
CC Exam
17 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Installation and Configuration System Tool For Hadoop
No ratings yet
Installation and Configuration System Tool For Hadoop
30 pages
Data Science eBook Bundle Offer
No ratings yet
Data Science eBook Bundle Offer
1 page
IOT - Unit - 4
No ratings yet
IOT - Unit - 4
62 pages
Big Data Architecture Basics
No ratings yet
Big Data Architecture Basics
8 pages
7 A Taxonomy and Survey On Distributed File Systems
No ratings yet
7 A Taxonomy and Survey On Distributed File Systems
6 pages
Training Proposal and Outlines For All Topics
No ratings yet
Training Proposal and Outlines For All Topics
148 pages
Data Science & Big Data Lab Guide
No ratings yet
Data Science & Big Data Lab Guide
167 pages
Unit 3 - Cloud Computing
No ratings yet
Unit 3 - Cloud Computing
12 pages
Hive
No ratings yet
Hive
45 pages
Cloud Computing Journal - 51
No ratings yet
Cloud Computing Journal - 51
71 pages
Hadoop MapReduce Dashboard Setup Guide
No ratings yet
Hadoop MapReduce Dashboard Setup Guide
39 pages
OpenStack Hadoop Deployment Guide
No ratings yet
OpenStack Hadoop Deployment Guide
9 pages
Google: Professional-Data-Engineer Exam
100% (3)
Google: Professional-Data-Engineer Exam
12 pages
DataWarehouseDesignDecisions PDF
No ratings yet
DataWarehouseDesignDecisions PDF
62 pages
Hadoop Setup on Ubuntu: Guide
No ratings yet
Hadoop Setup on Ubuntu: Guide
15 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
2 pages
Our History
No ratings yet
Our History
2 pages
Cs498 Week 12 Slide
No ratings yet
Cs498 Week 12 Slide
100 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
Hadoop - Presentation 101
No ratings yet
Hadoop - Presentation 101
10 pages
International Research Journal of Engineering and Technology (IRJET)
No ratings yet
International Research Journal of Engineering and Technology (IRJET)
8 pages
R23 Honor & Minors
No ratings yet
R23 Honor & Minors
20 pages
Bda Material Unit 2
No ratings yet
Bda Material Unit 2
19 pages
Big Data Interview Questions Guide
No ratings yet
Big Data Interview Questions Guide
16 pages

PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)

Uploaded by

PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)

Uploaded by

Spark By {Examples}

PySpark Tutorial For Beginners | Python Examples

Photo by Scott Graham on Unsplash

PySpark Tutorial For Beginners (Spark with Python)

1. PySpark Tutorial For Beginners (Spa...

PySpark Tutorial For Beginners (Spark with Python)

Who uses PySpark?

Related: How to run Pandas DataFrame on Apache Spark (PySpark)?

Following are the main features of PySpark.

PySpark is a general-purpose, in-memory, distributed processing engine that allows you

Cluster Manager Types

PySpark Modules & Packages

Modules & packages

PySpark RDD (pyspark.RDD)

Install Python or Anaconda distribution

Post installation, set JAVA_HOME and PATH variable.

JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201

Install Apache Spark

Now set the following environment variables.

You should see something like this below.

Apache Spark provides a suite of Web UIs

Spark History Server

Now, start the spark history server on Linux or Mac by running.

Spark History Server

Spyder IDE & Jupyter Notebook

Now, set the following environment variable.

PYTHONPATH => %SPARK_HOME%/python;$SPARK_HOME/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%

PySpark application running on Spyder IDE

Now let’s start the Jupyter Notebook

PySpark statements running on Jupyter Interface

PySpark RDD – Resilient Distributed Dataset

PySpark RDD (Resilient Distributed Dataset) is a fundamental data structure of PySpark

# Create RDD from parallelize

# Create RDD from external Data source

On PySpark RDD, you can perform two kinds of operations.

RDD transformations – Transformations are lazy operations. When you run a

DataFrame is a distributed collection of data organized into named columns. It is

Is PySpark faster than pandas?

By using createDataFrame() function of the SparkSession you can create a DataFrame.

df.show() shows the 20 elements from the DataFrame.

DataFrame from external data sources

Supported file formats

Different ways to Create DataFrame in PySpark

PySpark SQL Tutorial

In order to use SQL, first, create a temporary table on DataFrame using

Let’s see another pyspark example using group by.

groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")

This yields the below output

PySpark Streaming Tutorial

PySpark Streaming is a scalable, high-throughput, fault-tolerant streaming processing

Streaming from TCP Socket

Streaming from Kafka

df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value")

Common questions

Evaluate the advantages of using PySpark over Pandas DataFrames in scenarios involving large datasets.

Evaluate the advantages of using PySpark over Pandas DataFrames in scenarios involving large datasets.

Explain how PySpark's use of RDD and DataFrames provides advantages for data processing tasks compared to traditional data handling approaches.

Explain how PySpark's use of RDD and DataFrames provides advantages for data processing tasks compared to traditional data handling approaches.

How does PySpark facilitate the processing of large scale data more efficiently than traditional Python applications?

How does PySpark facilitate the processing of large scale data more efficiently than traditional Python applications?

Discuss the role of PySpark in real-time data processing applications and its integration with streaming data sources.

Discuss the role of PySpark in real-time data processing applications and its integration with streaming data sources.

Identify the main types of operations available in PySpark RDDs and their behavioral characteristics.

Identify the main types of operations available in PySpark RDDs and their behavioral characteristics.

What are the key features of PySpark that make it suitable for big data processing in comparison to other data processing tools?

What are the key features of PySpark that make it suitable for big data processing in comparison to other data processing tools?

How can PySpark's integration with Jupyter Notebook and Anaconda distribution enhance data science workflows?

How can PySpark's integration with Jupyter Notebook and Anaconda distribution enhance data science workflows?

Describe the architecture of a typical PySpark environment and its components' roles.

Describe the architecture of a typical PySpark environment and its components' roles.

What makes PySpark SQL a preferred choice for processing structured data, and what are its capabilities?

What makes PySpark SQL a preferred choice for processing structured data, and what are its capabilities?

What are the prerequisites for setting up a PySpark environment under Windows and its importance in machine learning applications?

What are the prerequisites for setting up a PySpark environment under Windows and its importance in machine learning applications?

You might also like