Split a List to Multiple Columns in Pyspark
Last Updated :
26 Apr, 2025
Have you ever been stuck in a situation where you have got the data of numerous columns in one column? Got confused at that time about how to split that dataset? This can be easily achieved in Pyspark in numerous ways. In this article, we will discuss regarding same.
Modules Required:
Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Methods to split a list into multiple columns in Pyspark:
- Using expr in comprehension list
- Splitting data frame row-wise and appending in columns
- Splitting data frame columnwise
Method 1: Using expr in comprehension list
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session, while expr is an SQL function used to execute SQL-like expressions. Also, the types is used to store all the datatypes of Pyspark.
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
from pyspark.sql.types import *
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, define the schema for creating the data frame with an array-typed column.
mySchema = StructType([StructField("Heading", StringType(), True),
StructField("Column", ArrayType(IntegerType(),True))])
Step 4: Later on, create the data frame that needs to be split into multiple columns.
data_frame = spark_session.createDataFrame([['column_heading1', [column1_data]],
['column_heading2', [column2_data]]],
schema= mySchema)
Step 5: Finally, split the list into columns using expr() function in the comprehension list.
data_frame.select([expr('Column[' + str(x) + ']') for x in range(0, number_of_columns)]).show()
Example:
In this example, we have defined the schema in which we want to define the data frame and then declared the data frame in the respective schema using the list of the data. Finally, we have split that dataset using expr function in the comprehension list.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
from pyspark.sql.types import *
spark_session = SparkSession.builder.getOrCreate()
mySchema = StructType([StructField( "Heading" ,
StringType(), True ), StructField(
"Column" , ArrayType(IntegerType(), True ))])
data_frame = spark_session.createDataFrame(
[[ 'A' , [ 1 , 2 , 3 ]], [ 'B' , [ 4 , 5 , 6 ]], [ 'C' , [ 7 , 8 , 9 ]]],
schema = mySchema)
data_frame.select([expr( 'Column[' + str (x) + ']' ) for x in range ( 0 , 3 )]).show()
|
Output:
+---------+---------+---------+
|Column[0]|Column[1]|Column[2]|
+---------+---------+---------+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---------+---------+---------+
Method 2: Splitting data frame row-wise and appending in columns
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session, while Row is used to represent Row in the data frame. Also, the col is used to represent the column in the data frame.
from pyspark.sql import SparkSession
from pyspark import Row
from pyspark.sql.functions import col
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, declare an array that you need to split into multiple columns.
arr=[[row1_data],[row2_data],[row3_data]]
Step 4: Later on, create the number of rows in the data frame.
data_frame = spark_session.createDataFrame([Row(index=1, finalArray = arr[0]),
Row(index=2, finalArray = arr[1]),
Row(index=3, finalArray = arr[2])])
Step 5: Finally, append the columns to the data frame.
data_frame.select([(col("finalArray")[x]).alias("Column "+str(x+1)) for x in range(0, 3)]).show()
Example:
In this example, we have declared the list for which we created the data frame that we have split row-wise and then put that split data in the columns for display.
Python3
from pyspark.sql import SparkSession
from pyspark import Row
from pyspark.sql.functions import col
spark_session = SparkSession.builder.getOrCreate()
arr = [[ 1 , 2 , 3 ], [ 4 , 5 , 6 ], [ 7 , 8 , 9 ]]
data_frame = spark_session.createDataFrame([Row(index = 1 ,
finalArray = arr[ 0 ]), Row(index = 2 , finalArray = arr[ 1 ]),
Row(index = 3 , finalArray = arr[ 2 ])])
data_frame.select([(col( "finalArray" )[x]).alias( "Column " + str (x + 1 ))
for x in range ( 0 , 3 )]).show()
|
Output:
+--------+--------+--------+
|Column 1|Column 2|Column 3|
+--------+--------+--------+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+--------+--------+--------+
Method 3: Splitting data frame columnwise
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, create a spark context.
sc=spark_session.sparkContext
Step 4: Later on, create the data frame that needs to be split into multiple columns.
data_frame = spark_session.createDataFrame(sc.parallelize([['column_heading1', [column1_data]],
['column_heading2', [column2_data]]]),
["key", "value"])
Step 5: Finally, split the data frame column-wise.
data_frame.select("key", data_frame.value[0], data_frame.value[1], data_frame.value[2]).show()
Example:
In this example, we have declared the list using Spark Context and then created the data frame of that list. Further, we have split the list into multiple columns and displayed that split data.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
sc = spark_session.sparkContext
data_frame = spark_session.createDataFrame(
sc.parallelize([[ 'Column 1' , [ 1 , 2 , 3 ]], [
'Column 2' , [ 4 , 5 , 6 ]], [ 'Column 3' , [ 7 , 8 , 9 ]]]), [ "key" , "value" ])
data_frame.select(
"key" , data_frame.value[ 0 ], data_frame.value[ 1 ],
data_frame.value[ 2 ]).show()
|
Output:
+--------+--------+--------+--------+
| key|value[0]|value[1]|value[2]|
+--------+--------+--------+--------+
|Column 1| 1| 2| 3|
|Column 2| 4| 5| 6|
|Column 3| 7| 8| 9|
+--------+--------+--------+--------+
Similar Reads
Split multiple array columns into rows in Pyspark
Suppose we have a Pyspark DataFrame that contains columns having different types of values like string, integer, etc., and sometimes the column data is in array format also. Working with the array is sometimes difficult and to remove the difficulty we wanted to split those array data into rows. Spli
5 min read
How to join on multiple columns in Pyspark?
In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python. Let's create the first dataframe: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an
3 min read
Pass multiple columns in UDF in Pyspark
In this article, we are going to learn how to pass multiple columns in UDF using Pyspark in Python. Pyspark has numerous types of functions, such as string functions, sort functions, Window functions, etc. but do you know Pyspark has also one of the most essential types of functions, i.e., User Defi
5 min read
Split single column into multiple columns in PySpark DataFrame
pyspark.sql.functions provide a function split() which is used to split DataFrame string Column into multiple columns. Syntax: pyspark.sql.functions.split(str, pattern, limit=- 1) Parameters: str: str is a Column or str to split.pattern: It is a str parameter, a string that represents a regular expr
4 min read
How to Add Multiple Columns in PySpark Dataframes ?
In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi C/C++ Code # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksess
2 min read
How to select and order multiple columns in Pyspark DataFrame ?
In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy
2 min read
Spark - Split array to separate column
Apache Spark is a potent big data processing system that can analyze enormous amounts of data concurrently over distributed computer clusters. PySpark is a Python-based interface for Apache Spark. Python programmers may create Spark applications more quickly and easily thanks to PySpark. Method 1: U
3 min read
Python PySpark - DataFrame filter on multiple columns
In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python. Creating Dataframe for demonestration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSes
2 min read
PySpark - Order by multiple columns
In this article, we are going to see how to orderby multiple columns in PySpark DataFrames through Python. Create the dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksessio
4 min read
How to rename multiple columns in PySpark dataframe ?
In this article, we are going to see how to rename multiple columns in PySpark Dataframe. Before starting let's create a dataframe using pyspark: C/C++ Code # importing module import pyspark from pyspark.sql.functions import col # importing sparksession from pyspark.sql module from pyspark.sql impor
2 min read