0% found this document useful (0 votes)
14 views10 pages

Pyspark

The document provides a comprehensive overview of creating and manipulating PySpark DataFrames, including initializing SparkSession, creating DataFrames from various sources, and performing data operations like filtering, grouping, and applying functions. It also discusses the integration of PySpark with pandas, testing PySpark applications using built-in utilities and frameworks like unittest and pytest. Additionally, it covers data input/output formats and the use of Spark Connect for remote connectivity.

Uploaded by

Sthefany Spina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views10 pages

Pyspark

The document provides a comprehensive overview of creating and manipulating PySpark DataFrames, including initializing SparkSession, creating DataFrames from various sources, and performing data operations like filtering, grouping, and applying functions. It also discusses the integration of PySpark with pandas, testing PySpark applications using built-in utilities and frameworks like unittest and pytest. Additionally, it covers data input/output formats and the use of Spark Connect for remote connectivity.

Uploaded by

Sthefany Spina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

pyspark

PySpark applications start with initializing SparkSession which is the entry point
of PySpark as below. In case of running it in PySpark shell via pyspark
create a pyspark dataframe with an explicit schema.
executable, the shell automatically creates the session in the variable spark for
users.

create a pyspark dataframe from a pandas dataframe


dataframe creation
A PySpark DataFrame can be created via
pyspark.sql.SparkSession.createDataFrame typically by passing a list of lists, tuples,
dictionaries and pyspark.sql.Rows, a pandas DataFrame and an RDD consisting of
such a list. pyspark.sql.SparkSession.createDataFrame takes the schema argument
to specify the schema of the DataFrame. When it is omitted, PySpark infers the
corresponding schema by taking a sample from the data.

The DataFrames created above all have the same results and schema.
Firstly, you can create a PySpark DataFrame from a list of rows

viewing data
The top rows of a DataFrame can be displayed using DataFrame.show().

Alternatively, you can enable spark.sql.repl.eagerEval.enabled configuration for the


eager evaluation of PySpark DataFrame in notebooks such as Jupyter. The number
of rows to show can be controlled via spark.sql.repl.eagerEval.maxNumRows
configuration.

selecting and accessing data


The rows can also be shown vertically. This is useful when rows are too long to show
PySpark DataFrame is lazily evaluated and simply selecting a column does not
horizontally.
trigger the computation but it returns a Column instance.

You can see the DataFrame’s schema and column names as follows:
In fact, most of column-wise operations return Columns.

Show the summary of the DataFrame

These Columns can be used to select the columns from a DataFrame. For
example, DataFrame.select() takes the Column instances that returns another
DataFrame.
DataFrame.collect() collects the distributed data to the driver side as the local
data in Python. Note that this can throw an out-of-memory error when the
dataset is too large to fit in the driver side because it collects all the data from
executors to the driver side.
assign new column instance.

In order to avoid throwing an out-of-memory exception,


use DataFrame.take() or DataFrame.tail(). To select a subset of rows, use DataFrame.filter().

PySpark DataFrame also provides the conversion back to a pandas DataFrame to


leverage pandas API. Note that toPandas also collects all data into the driver side
that can easily cause an out-of-memory-error when the data is too large to fit
applying a function
into the driver side.
PySpark supports various UDFs and APIs to allow users to execute Python native Grouping and then applying the avg() function to the resulting groups.
functions. See also the latest Pandas UDFs and Pandas Function APIs. For
instance, the example below allows users to directly use the APIs in a pandas
Series within Python native function.
You can also apply a Python native function against each group by using pandas API.

Another example is DataFrame.mapInPandas which allows users directly use the


APIs in a pandas DataFrame without any restrictions such as the result length.

co-grouping and applying a function.

grouping data
PySpark DataFrame also provides a way of handling grouped data by using the
common approach, split-apply-combine strategy. It groups the data by a certain
condition applies a function to each group and then combines them back to the
DataFrame.

getting data in/out


CSV is straightforward and easy to use. Parquet and ORC are efficient and
compact file formats to read and write faster.

There are many other data sources available in PySpark such as JDBC, text,
binaryFile, Avro, etc. See also the latest Spark SQL, DataFrames and Datasets
Guide in Apache Spark documentation.

CSV These SQL expressions can directly be mixed and used as PySpark columns.

Parquet

quickstart: spark connect


Spark Connect introduced a decoupled client-server architecture for Spark that
allows remote connectivity to Spark clusters using the DataFrame API.
ORC Spark Connect includes both client and server components and we will show you
how to set up and use both.

launch spark server with spark connect


To launch Spark with support for Spark Connect sessions, run the start-connect-
working with sql server.sh script.

DataFrame and Spark SQL share the same execution engine so they can be
interchangeably used seamlessly. For example, you can register the DataFrame as a
table and run a SQL easily as below:

connect to spark connect server


In addition, UDFs can be registered and invoked in SQL out of the box:
Now that the Spark server is running, we can connect to it remotely using Spark
Connect. We do this by creating a remote Spark session on the client where our
application runs. Before we can do that, we need to make sure to stop the
existing
regular Spark session because it cannot coexist with the remote Spark Connect
session we are about to create.

Creating a pandas DataFrame by passing a numpy array, with a datetime index and
The command we used above to launch the server configured Spark to run labeled columns:
as localhost:15002. So now we can create a remote Spark session on the client using
the following command.

Now, this pandas DataFrame can be converted to a pandas-on-Spark DataFrame

quickstart: pandas api on spark


Also, it is possible to create a pandas-on-Spark DataFrame from Spark DataFrame
This is a short introduction to pandas API on Spark, geared mainly for new users.
easily.
This notebook shows you some key differences between pandas and pandas API on
Spark. You can run this examples by yourself in ‘Live Notebook: pandas API on Creating a Spark DataFrame from pandas DataFrame
Spark’ at the quickstart page.

Customarily, we import pandas API on Spark as follows:

Creating pandas-on-Spark DataFrame from Spark DataFrame.


object creation
Creating a pandas-on-Spark Series by passing a list of values, letting pandas API
on Spark create a default integer index: Having specific dtypes . Types that are common to both Spark and pandas are
currently supported.

Creating a pandas-on-Spark DataFrame by passing a dict of objects that can be


converted to series-like. Here is how to show top rows from the frame below.
Note that the data in a Spark dataframe does not preserve the natural order by
default. The natural order can be preserved by
setting compute.ordered_head option but it causes a performance overhead
with sorting internally.

To drop any rows that have missing data.


Displaying the index, columns, and the underlying numpy data.

Filling missing data.

Showing a quick statistic summary of your data


operations
Stats
Transposing your data
Performing a descriptive statistic:

Sorting by its index


Grouping

By “group by” we are referring to a process involving one or more of the following
Sorting by value steps:

 Splitting the data into groups based on some criteria

 Applying a function to each group independently

missing data  Combining the results into a data structure

Pandas API on Spark primarily uses the value np.nan to represent missing data. It
is by default not included in computations.
Grouping and then applying the sum() function to the resulting groups.

Grouping by multiple columns forms a hierarchical index, and again we can Next, create a DataFrame.
apply the sum function.

Plotting

Now, let’s define and apply a transformation function to our DataFrame.

testing pyspark
This guide is a reference for writing robust tests for PySpark code.

To view the docs for PySpark test utils, see here. To see the code for PySpark
built- in test utils, check out the Spark repository here. To see the JIRA board
tickets for the PySpark test framework, see here.

Build a PySpark Application


testing your pyspark application
Here is an example for how to start a PySpark application. Feel free to skip to the
next section, “Testing your PySpark Application,” if you already have an Now let’s test our PySpark transformation function.
application you’re ready to test.
One option is to simply eyeball the resulting DataFrame. However, this can be
First, start your Spark Session. impractical for large DataFrame or input sizes.
A better way is to write tests. Here are some examples of how we can test our code.
The examples below apply for Spark 3.5 and above versions.
Option 2: Using Unit Test
Note that these examples are not exhaustive, as there are many other test
framework alternatives which you can use instead of unittest or pytest. The built- For more complex testing scenarios, you may want to use a testing framework.
in PySpark testing util functions are standalone, meaning they can be compatible One of the most popular testing framework options is unit tests. Let’s walk
with any test framework or CI test pipeline. through how you can use the built-in Python unittest library to write PySpark tests.
Option 1: Using Only PySpark Built-in Test Utility Functions For more information about the unittest library, see
here: https://docs.python.org/3/library/unittest.html.
For simple ad-hoc validation cases, PySpark testing utils
like assertDataFrameEqual and assertSchemaEqual can be used in a standalone First, you will need a Spark session. You can use the @classmethod decorator
context. You could easily test PySpark code in a notebook session. For example, from the unittest package to take care of setting up and tearing down a Spark
say you want to assert equality between two DataFrames: session.

Now let’s write a unittest class.

You can also simply compare two DataFrame schemas:


When run, unittest will pick up all functions with a name beginning with
When you run your test file with the pytest command, it will pick up all functions
“test.” Option 3: Using Pytest that have their name beginning with “test.”

We can also write our tests with pytest, which is one of the most popular Python putting it all together!
testing frameworks. For more information about pytest, see the docs
here: https://docs.pytest.org/en/7.1.x/contents.html. Let’s see all the steps together, in a Unit Test example.

Using a pytest fixture allows us to share a spark session across tests, tearing it down
when the tests are complete.

We can then define our tests like this:

You might also like