Split single column into multiple columns in PySpark DataFrame

Last Updated : 09 May, 2021

pyspark.sql.functions provide a function split() which is used to split DataFrame string Column into multiple columns.

Syntax: pyspark.sql.functions.split(str, pattern, limit=- 1)

Parameters:

str: str is a Column or str to split.
pattern: It is a str parameter, a string that represents a regular expression. This should be a Java regular expression.
limit: It is an int parameter. Optional an integer value when specified controls the number of times the pattern is applied.
limit > 0: The resulting array length must not be more than limit specified.
limit <= 0: The pattern must be applied as many times as possible or till the limit.

First Let's create a DataFrame.

Python3

# installing pyspark
!pip install pyspark

# importing pyspark
import pyspark

# importing SparkSession
from pyspark.sql import SparkSession

# importing all from pyspark.sql.function
from pyspark.sql.functions import *

# creating SparkSession object
spark=SparkSession.builder.appName("sparkdf").getOrCreate()


# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
        ('Milan', 'Sharma', '', '2000-05-19'),
        ('Rohit', 'Verma', 'M', '1978-09-05'),
        ('Maria', 'Anne', 'F', '1967-12-01'),
        ('Jay', 'Mehta', 'M', '1980-02-17')
        ]

# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'Gender', 'DOB']

# creating the dataframe df
df = spark.createDataFrame(data, columns)

# printing dataframe schema
df.printSchema()

# show dataframe
df.show()

Output:

Example 1: Split column using withColumn()

In this example, we created a simple dataframe with the column 'DOB' which contains the date of birth in yyyy-mm-dd in string format. Using the split and withColumn() the column will be split into the year, month, and date column.

Python3

# split() function defining parameters
split_cols = pyspark.sql.functions.split(df['DOB'], '-')

# Now applying split() using withColumn()
df1 = df.withColumn('Year', split_cols.getItem(0)) \
    .withColumn('Month', split_cols.getItem(1)) \
    .withColumn('Day', split_cols.getItem(2))

# show df
df1.show()

Output:

Alternatively, we can also write like this, it will give the same output:

Python3

# defining split() along with withColumn()
df2 = df.withColumn('Year', split(df['DOB'], '-').getItem(0)) \
    .withColumn('Month', split(df['DOB'], '-').getItem(1)) \
    .withColumn('Day', split(df['DOB'], '-').getItem(2))

# show df2
df2.show()

Output:

In the above example we have used 2 parameters of split() i.e.' str' that contains the column name and 'pattern' contains the pattern type of the data present in that column and to split data from that position.

Example 2: Split column using select()

In this example we will use the same DataFrame df and split its 'DOB' column using .select():

Python3

# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
        ('Milan', 'Sharma', '', '2000-05-19'),
        ('Rohit', 'Verma', 'M', '1978-09-05'),
        ('Maria', 'Anne', 'F', '1967-12-01'),
        ('Jay', 'Mehta', 'M', '1980-02-17')
        ]

# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'DOB']

# creating the dataframe df
df = spark.createDataFrame(data, columns)

# printing dataframe schema
df.printSchema()

# show dataframe
df.show()

# defining split ()
split_cols = pyspark.sql.functions.split(df['DOB'], '-')

# applying split() using select()
df3 = df.select('First Name', 'Last Name', 'Gender', 'DOB',
                split_cols.getItem(0).alias('year'),
                split_cols.getItem(1).alias('month'),
                split_cols.getItem(2).alias('day'))

# show df3
df3.show()

Output:

In the above example, we have not selected the 'Gender' column in select(), so it is not visible in resultant df3.

Example 3: Splitting another string column

Python3

# creating the row data for dataframe
data = [('Jaya', 'Sinha'), ('Milan', 'Soni'),
        ('Rohit', 'Verma'), ('Maria', 'Anne'), 
        ('Jay', 'Mehta')]

# giving the column names for the dataframe
columns = ['First Name', 'Last Name']

# creating the dataframe df
df = spark.createDataFrame(data, columns)

# printing dataframe schema
df.printSchema()

# show dataframe
df.show()

# defining split()
split_cols = pyspark.sql.functions.split(df['Last Name'], '')

# applying split() using .withColumn()
df = df.withColumn('1', split_cols.getItem(0)) \
       .withColumn('2', split_cols.getItem(1)) \
       .withColumn('3', split_cols.getItem(2)) \
       .withColumn('4', split_cols.getItem(3)) \
       .withColumn('5', split_cols.getItem(4))

# show df
df.show()

Output:

In the above example, we have taken only two columns First Name and Last Name and split the Last Name column values into single characters residing in multiple columns.