Converting a PySpark DataFrame Column to a Python List
Last Updated :
01 Dec, 2021
In this article, we will discuss how to convert Pyspark dataframe column to a Python list.
Creating dataframe for demonstration:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "sravan", "vignan", 67, 89],
["2", "ojaswi", "vvit", 78, 89],
["3", "rohith", "vvit", 100, 80],
["4", "sridevi", "vignan", 78, 80],
["1", "sravan", "vignan", 89, 98],
["5", "gnanesh", "iit", 94, 98]]
# specify column names
columns = ['student ID', 'student NAME',
'college', 'subject1', 'subject2']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# display dataframe
dataframe.show()
Output:

Method 1: Using flatMap()
This method takes the selected column as the input which uses rdd and converts it into the list.
Syntax: dataframe.select('Column_Name').rdd.flatMap(lambda x: x).collect()
where,
- dataframe is the pyspark dataframe
- Column_Name is the column to be converted into the list
- flatMap() is the method available in rdd which takes a lambda expression as a parameter and converts the column into list
- collect() is used to collect the data in the columns
Example 1: Python code to convert particular column to list using flatMap
Python3
# convert student Name to list using
# flatMap
print(dataframe.select('student Name').
rdd.flatMap(lambda x: x).collect())
# convert student ID to list using
# flatMap
print(dataframe.select('student ID').
rdd.flatMap(lambda x: x).collect())
Output:
['sravan', 'ojaswi', 'rohith', 'sridevi', 'sravan', 'gnanesh']
['1', '2', '3', '4', '1', '5']
Example 2: Convert multiple columns to list.
Python3
# convert multiple columns to list using flatMap
print(dataframe.select(['student Name',
'student Name',
'college']).
rdd.flatMap(lambda x: x).collect())
Output:Â
['sravan', 'sravan', 'vignan', 'ojaswi', 'ojaswi', 'vvit', 'rohith', 'rohith', 'vvit', 'sridevi', 'sridevi', 'vignan', 'sravan', 'sravan', Â 'vignan', 'gnanesh', 'gnanesh', 'iit']
Method 2: Using map()
This function is used to map the given dataframe column to list
Syntax: dataframe.select('Column_Name').rdd.map(lambda x : x[0]).collect()
where,
- dataframe is the pyspark dataframe
- Column_Name is the column to be converted into the list
- map() is the method available in rdd which takes a lambda expression as a parameter and converts the column into list
- collect() is used to collect the data in the columns
Example: Python code to convert pyspark dataframe column to list using the map function.
Python3
# convert student Name to list using map
print(dataframe.select('student Name').
rdd.map(lambda x : x[0]).collect())
# convert student ID to list using map
print(dataframe.select('student ID').
rdd.map(lambda x : x[0]).collect())
# convert student college to list using
# map
print(dataframe.select('college').
rdd.map(lambda x : x[0]).collect())
Output:
['sravan', 'ojaswi', 'rohith', 'sridevi', 'sravan', 'gnanesh']
['1', '2', '3', '4', '1', '5']
['vignan', 'vvit', 'vvit', 'vignan', 'vignan', 'iit']
Method 3: Using collect()
Collect is used to collect the data from the dataframe, we will use a comprehension data structure to get pyspark dataframe column to list with collect() method.Â
Syntax: [data[0] for data in dataframe.select('column_name').collect()]
Where,
- dataframe is the pyspark dataframe
- data is the iterator of the dataframe column
- column_name is the column in the dataframe
Example: Python code to convert dataframe columns to list using collect() method
Python3
# display college column in
# the list format using comprehension
print([data[0] for data in dataframe.
select('college').collect()])
# display student ID column in the
# list format using comprehension
print([data[0] for data in dataframe.
select('student ID').collect()])
# display subject1 column in the list
# format using comprehension
print([data[0] for data in dataframe.
select('subject1').collect()])
# display subject2 column in the
# list format using comprehension
print([data[0] for data in dataframe.
select('subject2').collect()])
Output:
['vignan', 'vvit', 'vvit', 'vignan', 'vignan', 'iit']
['1', '2', '3', '4', '1', '5']
[67, 78, 100, 78, 89, 94]
[89, 89, 80, 80, 98, 98]
Method 4: Using toLocalIterator()
This method is used to iterate the column values in the dataframe, we will use a comprehension data structure to get pyspark dataframe column to list with toLocalIterator() method.
Syntax: [data[0] for data in dataframe.select('column_name').toLocalIterator()]
Where,
- dataframe is the pyspark dataframe
- data is the iterator of the dataframe column
- column_name is the column in the dataframe
Example: Convert pyspark dataframe columns to list using toLocalIterator() method
Python3
# display college column in the list
# format using comprehension
print([data[0] for data in dataframe.
select('college').collect()])
# display student ID column in the
# list format using comprehension
print([data[0] for data in dataframe.
select('student ID').toLocalIterator()])
# display subject1 column in the list
# format using comprehension
print([data[0] for data in dataframe.
select('subject1').toLocalIterator()])
# display subject2 column in the
# list format using comprehension
print([data[0] for data in dataframe.
select('subject2').toLocalIterator()])
Output:
['vignan', 'vvit', 'vvit', 'vignan', 'vignan', 'iit']
['1', '2', '3', '4', '1', '5']
[67, 78, 100, 78, 89, 94]
[89, 89, 80, 80, 98, 98]
Method 5: Using toPandas()
Used to convert a column to dataframe, and then we can convert it into a list.Â
Syntax: list(dataframe.select('column_name').toPandas()['column_name'])
Where,
- toPandas() is used to convert particular column to dataframe
- column_name is the column in the pyspark dataframe
Example: Convert pyspark dataframe columns to list using toPandas() method
Python3
# display college column in
# the list format using toPandas
print(list(dataframe.select('college').
toPandas()['college']))
# display student NAME column in
# the list format using toPandas
print(list(dataframe.select('student NAME').
toPandas()['student NAME']))
# display subject1 column in
# the list format using toPandas
print(list(dataframe.select('subject1').
toPandas()['subject1']))
# display subject2 column
# in the list format using toPandas
print(list(dataframe.select('subject2').
toPandas()['subject2']))
Output:
['vignan', 'vvit', 'vvit', 'vignan', 'vignan', 'iit']
['sravan', 'ojaswi', 'rohith', 'sridevi', 'sravan', 'gnanesh']
[67, 78, 100, 78, 89, 94]
[89, 89, 80, 80, 98, 98]
Similar Reads
Convert Python Dictionary List to PySpark DataFrame In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
Convert PySpark DataFrame to Dictionary in Python In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark se
3 min read
How to add a constant column in a PySpark DataFrame? In this article, we are going to see how to add a constant column in a PySpark Dataframe. It can be done in these ways: Using Lit()Using Sql query. Creating Dataframe for demonstration: Python3 # Create a spark session from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark
2 min read
Pyspark - Converting JSON to DataFrame In this article, we are going to convert JSON String to DataFrame in Pyspark. Method 1: Using read_json() We can read JSON files using pandas.read_json. This method is basically used to read JSON files through pandas. Syntax: pandas.read_json("file_name.json") Here we are going to use this JSON file
1 min read
Convert PySpark dataframe to list of tuples In this article, we are going to convert the Pyspark dataframe into a list of tuples. The rows in the dataframe are stored in the list separated by a comma operator. So we are going to create a dataframe by using a nested list Creating Dataframe for demonstration: Python3 # importing module import p
2 min read
Convert comma separated string to array in PySpark dataframe In this article, we will learn how to convert comma-separated string to array in pyspark dataframe. In pyspark SQL, the split() function converts the delimiter separated String to an Array. Â It is done by splitting the string based on delimiters like spaces, commas, and stack them into an array. Thi
3 min read
Convert PySpark Row List to Pandas DataFrame In this article, we will convert a PySpark Row List to Pandas Data Frame. A Row object is defined as a single Row in a PySpark DataFrame. Thus, a Data Frame can be easily represented as a Python List of Row objects. Method 1 : Use createDataFrame() method and use toPandas() method Here is the syntax
4 min read
How to add a new column to a PySpark DataFrame ? In this article, we will discuss how to add a new column to PySpark Dataframe. Create the first data frame for demonstration: Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose. Python3 # importing module import pyspark # importing spark
9 min read
How to Convert Pandas to PySpark DataFrame ? In this article, we will learn How to Convert Pandas to PySpark DataFrame. Sometimes we will get csv, xlsx, etc. format data, and we have to store it in PySpark DataFrame and that can be done by loading data in Pandas then converted PySpark DataFrame. For conversion, we pass the Pandas dataframe int
3 min read
Convert Column with Comma Separated List in Spark DataFrame Spark DataFrames is a distributed collection of data organized into named columns. They are similar to tables in a traditional relational database but can handle large amounts of data more efficiently thanks to their distributed nature. DataFrames can be created from a variety of sources such as str
3 min read