Spark Trim String Column on DataFrame
Last Updated :
24 Apr, 2025
In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself.
PySpark Trim String Column on DataFrame
Below are the ways by which we can trim String Column on DataFrame in PySpark:
- Using withColumn with rtrim()
- Using withColumn with trim()
- Using select()
- Using SQL Expression
- Using PySpark trim(), rtrim(), ltrim()
PySpark Trim using withColumn() with rtrim()
In this example, we are using withColumn() with rtrim() to remove white spaces.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, ltrim, rtrim, col
# Create a Spark session
spark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()
# Define your data
data = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]
# Create a DataFrame
df = spark.createDataFrame(data, ["col1", "col2"])
# Show the initial DataFrame
df.show()
# Using withColumn to remove white spaces
df = df.withColumn("col2", rtrim(col("col2")))
df.show()
# Stop the Spark session
spark.stop()
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
PySpark Trim using withColumn with trim()
In this example, we are using withColumn() with trim() to remove white spaces.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, ltrim, rtrim, col
# Create a Spark session
spark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()
# Define your data
data = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]
# Create a DataFrame
df = spark.createDataFrame(data, ["col1", "col2"])
# Show the initial DataFrame
df.show()
# Using withColumn to remove white spaces
df = df.withColumn("col2", trim(col("col2")))
df.show()
# Stop the Spark session
spark.stop()
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
PySpark Trim using select() to Remove White Spaces
In this example, we are using select() to trim in PySpark.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, ltrim, rtrim, col
# Create a Spark session
spark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()
# Define your data
data = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]
# Create a DataFrame
df = spark.createDataFrame(data, ["col1", "col2"])
# Show the initial DataFrame
df.show()
# Using select to remove white spaces
df = df.select(col("col1"), trim(col("col2")).alias("col2"))
df.show()
# Stop the Spark session
spark.stop()
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
Using SQL Expression
In this example, we are using SQL Expression to remove extra whitespaces.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, ltrim, rtrim, col
# Create a Spark session
spark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()
# Define your data
data = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]
# Create a DataFrame
df = spark.createDataFrame(data, ["col1", "col2"])
# Show the initial DataFrame
df.show()
# Using SQL Expression to remove white spaces
df.createOrReplaceTempView("TAB")
spark.sql("SELECT col1, TRIM(col2) AS col2 FROM TAB").show()
# Stop the Spark session
spark.stop()
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
Using PySpark trim(), rtrim(), ltrim()
In PySpark, we can easily remove whitespaces or trim by using pyspark.sql.functions.trim() function. We use ltrim() to remove only left white spaces and rtrim() function to remove right white spaces.
Python3
from pyspark.sql.functions import trim, ltrim, rtrim
data = [(1, "ABC "), (2, " DEF"), (3, " GHI ")]
df = spark.createDataFrame(data=data, schema=["col1", "col2"])
df.show()
# using withColumn and trim()
df.withColumn("col2", trim("col2")).show()
# using ltrim()
df.withColumn("col2", ltrim("col2")).show()
# using rtrim()
df.withColumn("col2", rtrim("col2")).show()
# Using select
df.select("col1", trim("col2").alias('col2')).show()
# Using SQL Expression
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
Similar Reads
Pivot String column on Pyspark Dataframe
Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate
4 min read
Select columns in PySpark dataframe
In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We
4 min read
How to verify Pyspark dataframe column type ?
While working with a big Dataframe, Dataframe consists of any number of columns that are having different datatypes. For pre-processing the data to apply operations on it, we have to know the dimensions of the Dataframe and datatypes of the columns which are present in the Dataframe. In this article
4 min read
How to Get substring from a column in PySpark Dataframe ?
In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column. We can get the substring of the column using substring() and substr() function. Syntax: substring(str,pos,len) df.col_n
3 min read
PySpark - Select Columns From DataFrame
In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected
2 min read
How to add a new column to a PySpark DataFrame ?
In this article, we will discuss how to add a new column to PySpark Dataframe. Create the first data frame for demonstration: Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose. Python3 # importing module import pyspark # importing spark
9 min read
How to Change Column Type in PySpark Dataframe ?
In this article, we are going to see how to change the column type of pyspark dataframe. Creating dataframe for demonstration: Python # Create a spark session from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkExamples').getOrCreate() # Create a spark dataframe columns =
4 min read
Removing Blank Strings from a PySpark Dataframe
Cleaning and preprocessing data is a crucial step before it can be used for analysis or modeling. One of the common tasks in data preparation is removing empty strings from a Spark dataframe. A Spark dataframe is a distributed collection of data that is organized into rows and columns. It can be pro
4 min read
How to lowercase strings in a column in Pandas dataframe
Analyzing real-world data is somewhat difficult because we need to take various things into consideration. Apart from getting the useful data from large datasets, keeping data in required format is also very important. One might encounter a situation where we need to lowercase each letter in any spe
2 min read
Pyspark Dataframe - Map Strings to Numeric
In this article, we are going to see how to convert map strings to numeric. Creating dataframe for demonstration: Here we are creating a row of data for college names and then pass the createdataframe() method and then we are displaying the dataframe. Python3 # importing module import pyspark # impo
3 min read