Spark - Split array to separate column
Last Updated :
28 Apr, 2025
Apache Spark is a potent big data processing system that can analyze enormous amounts of data concurrently over distributed computer clusters. PySpark is a Python-based interface for Apache Spark. Python programmers may create Spark applications more quickly and easily thanks to PySpark.
Method 1: Using The Function Split()
In this example first, the required package "split" is imported from the "pyspark.sql.functions" module. Then, a SparkSession is created. Next, a PySpark DataFrame is created with two columns "id" and "fruits" and two rows with the values "1, apple, orange, banana" and "2, grape, kiwi, peach". Using the "split" function, the "fruits" column is split into an array of strings and stored in a new column "fruit_list". Then, a new DataFrame "df3" is created by selecting the first three elements of the "fruit_list" array using the "getItem" function and aliasing them as "fruit1", "fruit2", and "fruit3". Finally, the "show" method is called on both DataFrames to display the output.
Syntax: split(str: Column, pattern: str) -> Column
The split method returns a new PySpark Column object that represents an array of strings. Each element in the array is a substring of the original column that was split using the specified pattern.
The split method takes two parameters:
- str: The PySpark column to split. This can be a string column, a column expression, or a column name.
- pattern: The string or regular expression to split the column on.
Python3
# importing packages
from pyspark.sql.functions import split
from pyspark.sql import SparkSession
# creating a spark session
spark = SparkSession.builder.appName('split_array_to_columns').getOrCreate()
# Create a PySpark DataFrame
df = spark.createDataFrame([(1, "apple,orange,banana"),
(2, "grape,kiwi,peach")], ["id", "fruits"])
# Split the "fruits" column into an array of strings
df = df.withColumn("fruit_list", split(df.fruits, ","))
# Show the resulting DataFrame
df.show()
df3 = df.select(
df.fruit_list.getItem(0).alias('fruit1'),
df.fruit_list.getItem(1).alias('fruit2'),
df.fruit_list.getItem(2).alias('fruit3')
)
df3.show()
Output:
Output ImageMethod 2: Using the function getItem()
In this example, first, let's create a data frame that has two columns "id" and "fruits". To split the fruits array column into separate columns, we use the PySpark getItem() function along with the col() function to create a new column for each fruit element in the array. The getItem() function is a PySpark SQL function that allows you to extract a single element from an array column in a DataFrame. It takes an integer index as a parameter and returns the element at that index in the array.
Syntax: getItem(index)
The index parameter specifies the index of the element to extract from the array column. The index can be an integer or a column expression that evaluates to an integer.
Python3
# importing packages
from pyspark.sql import SparkSession
from pyspark.sql.functions import array, col
# creating a spark session
spark = SparkSession.builder.appName('split_array_to_columns').getOrCreate()
data = [(1, ['apple', 'banana', 'orange']),
(2, ['grape', 'kiwi', 'pineapple', 'watermelon']),
(3, ['peach', 'pear'])]
# creating a dataframe
df = spark.createDataFrame(data, ['id', 'fruits'])
df.show()
# splitting the fruits column into multiple columns
df = df.select('id',
*[col('fruits').getItem(i).alias(fruit{i+1}') for i in range(0, 4)])
df.printSchema()
Output:
Output Image
Similar Reads
PySpark - Split dataframe by column value A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python. There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either
3 min read
Split a List to Multiple Columns in Pyspark Have you ever been stuck in a situation where you have got the data of numerous columns in one column? Got confused at that time about how to split that dataset? This can be easily achieved in Pyspark in numerous ways. In this article, we will discuss regarding same. Modules Required: Pyspark: An o
5 min read
How to split Array in Scala? Arrays are a fundamental data structure in Scala, used to store collections of elements of the same type. Splitting an array involves dividing it into smaller sub-arrays based on specific criteria. This article explores different methods for achieving this in Scala. Table of Content 1. Using `splitA
3 min read
Split multiple array columns into rows in Pyspark Suppose we have a Pyspark DataFrame that contains columns having different types of values like string, integer, etc., and sometimes the column data is in array format also. Working with the array is sometimes difficult and to remove the difficulty we wanted to split those array data into rows. Spl
5 min read
Pyspark - Parse a Column of JSON Strings In this article, we are going to discuss how to parse a column of json strings into their own separate columns. Here we will parse or read json string present in a csv file and convert it into multiple dataframe columns using Python Pyspark. Example 1: Parse a Column of JSON Strings Using pyspark.sq
4 min read
Spark dataframe - Split struct column into two columns In this article, we are going to learn how to split the struct column into two columns using PySpark in Python. Spark is an open-source, distributed processing system that is widely used for big data workloads. It is designed to be fast, easy to use, and flexible, and it provides a wide range of fun
5 min read