Split single column into multiple columns in PySpark DataFrame
Last Updated :
09 May, 2021
pyspark.sql.functions provide a function split() which is used to split DataFrame string Column into multiple columns.
Syntax: pyspark.sql.functions.split(str, pattern, limit=- 1)
Parameters:
- str: str is a Column or str to split.
- pattern: It is a str parameter, a string that represents a regular expression. This should be a Java regular expression.
- limit: It is an int parameter. Optional an integer value when specified controls the number of times the pattern is applied.
- limit > 0: The resulting array length must not be more than limit specified.
- limit <= 0: The pattern must be applied as many times as possible or till the limit.
First Let's create a DataFrame.
Python3
# installing pyspark
!pip install pyspark
# importing pyspark
import pyspark
# importing SparkSession
from pyspark.sql import SparkSession
# importing all from pyspark.sql.function
from pyspark.sql.functions import *
# creating SparkSession object
spark=SparkSession.builder.appName("sparkdf").getOrCreate()
# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
('Milan', 'Sharma', '', '2000-05-19'),
('Rohit', 'Verma', 'M', '1978-09-05'),
('Maria', 'Anne', 'F', '1967-12-01'),
('Jay', 'Mehta', 'M', '1980-02-17')
]
# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'Gender', 'DOB']
# creating the dataframe df
df = spark.createDataFrame(data, columns)
# printing dataframe schema
df.printSchema()
# show dataframe
df.show()
Output:
DataFrame created
Example 1: Split column using withColumn()
In this example, we created a simple dataframe with the column 'DOB' which contains the date of birth in yyyy-mm-dd in string format. Using the split and withColumn() the column will be split into the year, month, and date column.
Python3
# split() function defining parameters
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
# Now applying split() using withColumn()
df1 = df.withColumn('Year', split_cols.getItem(0)) \
.withColumn('Month', split_cols.getItem(1)) \
.withColumn('Day', split_cols.getItem(2))
# show df
df1.show()
Output:
Dataframe after splitting columns
Alternatively, we can also write like this, it will give the same output:
Python3
# defining split() along with withColumn()
df2 = df.withColumn('Year', split(df['DOB'], '-').getItem(0)) \
.withColumn('Month', split(df['DOB'], '-').getItem(1)) \
.withColumn('Day', split(df['DOB'], '-').getItem(2))
# show df2
df2.show()
Output:
In the above example we have used 2 parameters of split() i.e.' str' that contains the column name and 'pattern' contains the pattern type of the data present in that column and to split data from that position.
Example 2: Split column using select()
In this example we will use the same DataFrame df and split its 'DOB' column using .select():
Python3
# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
('Milan', 'Sharma', '', '2000-05-19'),
('Rohit', 'Verma', 'M', '1978-09-05'),
('Maria', 'Anne', 'F', '1967-12-01'),
('Jay', 'Mehta', 'M', '1980-02-17')
]
# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'DOB']
# creating the dataframe df
df = spark.createDataFrame(data, columns)
# printing dataframe schema
df.printSchema()
# show dataframe
df.show()
# defining split ()
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
# applying split() using select()
df3 = df.select('First Name', 'Last Name', 'Gender', 'DOB',
split_cols.getItem(0).alias('year'),
split_cols.getItem(1).alias('month'),
split_cols.getItem(2).alias('day'))
# show df3
df3.show()
Output:

In the above example, we have not selected the 'Gender' column in select(), so it is not visible in resultant df3.
Example 3: Splitting another string column
Python3
# creating the row data for dataframe
data = [('Jaya', 'Sinha'), ('Milan', 'Soni'),
('Rohit', 'Verma'), ('Maria', 'Anne'),
('Jay', 'Mehta')]
# giving the column names for the dataframe
columns = ['First Name', 'Last Name']
# creating the dataframe df
df = spark.createDataFrame(data, columns)
# printing dataframe schema
df.printSchema()
# show dataframe
df.show()
# defining split()
split_cols = pyspark.sql.functions.split(df['Last Name'], '')
# applying split() using .withColumn()
df = df.withColumn('1', split_cols.getItem(0)) \
.withColumn('2', split_cols.getItem(1)) \
.withColumn('3', split_cols.getItem(2)) \
.withColumn('4', split_cols.getItem(3)) \
.withColumn('5', split_cols.getItem(4))
# show df
df.show()
Output:

In the above example, we have taken only two columns First Name and Last Name and split the Last Name column values into single characters residing in multiple columns.
Similar Reads
How to rename multiple columns in PySpark dataframe ? In this article, we are going to see how to rename multiple columns in PySpark Dataframe. Before starting let's create a dataframe using pyspark: Python3 # importing module import pyspark from pyspark.sql.functions import col # importing sparksession from pyspark.sql module from pyspark.sql import S
2 min read
How to select and order multiple columns in Pyspark DataFrame ? In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy
2 min read
How to Rename Multiple PySpark DataFrame Columns In this article, we will discuss how to rename the multiple columns in PySpark Dataframe. For this we will use withColumnRenamed() and toDF() functions. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql i
2 min read
Split multiple array columns into rows in Pyspark Suppose we have a Pyspark DataFrame that contains columns having different types of values like string, integer, etc., and sometimes the column data is in array format also. Working with the array is sometimes difficult and to remove the difficulty we wanted to split those array data into rows. Spl
5 min read
How to Add Multiple Columns in PySpark Dataframes ? In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio
2 min read
Spark dataframe - Split struct column into two columns In this article, we are going to learn how to split the struct column into two columns using PySpark in Python. Spark is an open-source, distributed processing system that is widely used for big data workloads. It is designed to be fast, easy to use, and flexible, and it provides a wide range of fun
5 min read