In this article, we are going to learn how to apply a transformation to multiple columns in a data frame using Pyspark in Python.
The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. While using Pyspark, you might have felt the need to apply the same function whether it is uppercase, lowercase, subtract, add, etc. to apply to multiple columns. This is possible in Pyspark in not only one way but numerous ways. In this article, we will discuss all the ways to apply a transformation to multiple columns of the PySpark data frame.
Methods to apply a transformation to multiple columns of the PySpark data frame:
Method 1: Using reduce function
An aggregate action function that is used to calculate the min, the max and the total of elements in a dataset is known as reduce function. In this method, we will import the CSV file or create the dataset and then apply a transformation using reduce function to the multiple columns of the uploaded or the created data frame.
Stepwise Implementation:
Step 1: First, import the required libraries, i.e. SparkSession, reduce, col, and upper. The SparkSession library is used to create the session, while reduce applies a particular function passed to all of the list elements mentioned in the sequence. The col is used to get the column name, while the upper is used to convert the text to upper case. Instead of upper, you can use any other function too that you want to apply on each row of the data frame.
from pyspark.sql import SparkSession
from functools import reduce
from pyspark.sql.functions import col, upper
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file or create the data frame using the createDataFrame function.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
sep = ',', inferSchema = True, header = True)
or
data_frame=spark_session.createDataFrame([(column_1_data), (column_2_data), (column_3_data)],
['column_name_1', 'column_name_2', 'column_name_3'])
Step 4: Next, apply a particular function passed as an argument to all the row elements of the data frame using reduce function.
updated_data_frame = (reduce(lambda traverse_df,
col_name: traverse_df.withColumn(col_name, upper(col(col_name))),
data_frame.columns, data_frame))
Step 5: Finally, display the updated data frame in the previous step.
updated_data_frame.show()
Example:
In this example, we have uploaded the CSV file (link), i.e., basically a data set of 5*5 as follows:
Then, we used the reduce function to apply a transformation to multiple columns ‘name‘ and ‘subject‘ of the Pyspark data frame uppercase through the function upper.
Python3
from pyspark.sql import SparkSession
from functools import reduce
from pyspark.sql.functions import col, upper
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/student_data.csv' ,
sep = ',' , inferSchema = True ,
header = True )
updated_data_frame = ( reduce ( lambda traverse_df,
col_name: traverse_df.withColumn(col_name,
upper(col(col_name))), data_frame.columns,
data_frame))
updated_data_frame.show()
|
Output:
Method 2: Using for loop
A particular way of iterating over a sequence, i.e., a list, a tuple, a dictionary, a set, or a string) is known as for loop. In this method, we will import the CSV file or create the dataset and then apply a transformation using for loop to the multiple columns of the uploaded or the created data frame.
Stepwise Implementation
Step 1: First, import the required libraries, i.e. SparkSession, reduce, col, and upper. The SparkSession library is used to create the session. The col is used to get the column name, while the upper is used to convert the text to upper case. Instead of upper, you can use any other function too that you want to apply on each row of the data frame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file or create the data frame using the createDataFrame function.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
sep = ',', inferSchema = True, header = True)
or
data_frame=spark_session.createDataFrame([(column_1_data), (column_2_data), (column_3_data)],
['column_name_1', 'column_name_2', 'column_name_3'])
Step 4: Next, create a for loop to traverse all the elements and convert it to uppercase.
for col_name in data_frame.columns:
data_frame = data_frame.withColumn(col_name, upper(col(col_name)))
Step 5: Finally, display the updated data frame in the previous step.
data_frame.show()
Example:
In this example, we have uploaded the CSV file (link), i.e., basically a data set of 5*5 as follows:
Then, we used the for loop to apply a transformation to multiple columns ‘name‘ and ‘subject‘ of the Pyspark data frame uppercase through the function upper.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/student_data.csv' ,
sep = ',' , inferSchema = True ,
header = True )
for col_name in data_frame.columns:
data_frame = data_frame.withColumn(col_name,
upper(col(col_name)))
data_frame.show()
|
Output:
Method 3: Using list comprehension
A shorter way of creating a new list based on the values of an existing list is known as list comprehension. In this method, we will import the CSV file or create the dataset and then apply a transformation using list comprehension to the multiple columns of the uploaded or the created data frame.
Stepwise Implementation:
Step 1: First, import the required libraries, i.e. SparkSession, col, and upper. The SparkSession library is used to create the session. The col is used to get the column name, while the upper is used to convert the text to upper case. Instead of upper, you can use any other function too that you want to apply on each row of the data frame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file or create the data frame using the createDataFrame function.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
sep = ',', inferSchema = True, header = True)
or
data_frame=spark_session.createDataFrame([(column_1_data), (column_2_data), (column_3_data)],
['column_name_1', 'column_name_2', 'column_name_3'])
Step 4: Next, create a list comprehension to traverse all the elements and convert it to uppercase.
updated_data_frame = data_frame.select(
*[upper(col(col_name)).name(col_name) for col_name in data_frame.columns])
Step 5: Finally, display the updated data frame in the previous step.
updated_data_frame.show()
Example:
In this example, we have uploaded the CSV file (link), i.e., basically a data set of 5*5 as follows:
Then, we used the list comprehension to apply a transformation to multiple columns ‘name‘ and ‘subject‘ of the Pyspark data frame uppercase through the function upper.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/student_data.csv' ,
sep = ',' , inferSchema = True ,
header = True )
updated_data_frame = data_frame.select(
* [upper(col(col_name)).name(col_name) for col_name in data_frame.columns])
updated_data_frame.show()
|
Output:

Similar Reads
PySpark - Apply custom schema to a DataFrame
In this article, we are going to apply custom schema to a data frame using Pyspark in Python. A distributed collection of rows under named columns is known as a Pyspark data frame. Usually, the schema of the Pyspark data frame is inferred from the data frame itself, but Pyspark also gives the featur
6 min read
PySpark RDD - Sort by Multiple Columns
In this article, we are going to learn sorting Pyspark RDD by multiple columns in Python. There occurs various situations in being a data scientist when you get unsorted data and there is not only one column unsorted but multiple columns are unsorted. This situation can be overcome by sorting the da
7 min read
Dynamically Rename Multiple Columns in PySpark DataFrame
In this article, we are going to learn how to dynamically rename multiple columns in Pyspark data frame in Python. A data frame that is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession is known as Pyspark data frame. While working in Pyspark,
13 min read
Add Multiple Columns Using UDF in PySpark
In this article, we are going to learn how to add multiple columns using UDF in Pyspark in Python. Have you ever worked on a Pyspark data frame? If yes, then you might surely know how to add a column and you might have also done it. But have you ever thought about how you can add multiple columns us
5 min read
PySpark convert multiple columns to map
In this article, we are going to convert multiple columns to map using Pyspark in Python. An RDD transformation that is used to apply the transformation function on every element of the data frame is known as a map. While working in the Pyspark data frame, we might encounter some circumstances in wh
3 min read
Converting a PySpark Map/Dictionary to Multiple Columns
In this article, we are going to learn about converting a column of type 'map' to multiple columns in a data frame using Pyspark in Python. A data type that represents Python Dictionary to store key-value pair, a MapType object and comprises three fields, keyType, valueType, and valueContainsNull is
6 min read
Adding StructType columns to PySpark DataFrames
In this article, we are going to learn about adding StructType columns to Pyspark data frames in Python. The interface which allows you to write Spark applications using Python APIs is known as Pyspark. While creating the data frame in Pyspark, the user can not only create simple data frames but can
4 min read
PySpark sampleBy using multiple columns
In this article, we are going to learn about PySpark sampleBy using multiple columns in Python. While doing the data processing of the big data. There are many cases where we need a sample of data. In Pyspark, we can get the sample of data by using sampleBy() function to get the sample of data. In t
5 min read
Applying a custom function on PySpark Columns with UDF
In this article, we are going to learn how to apply a custom function on Pyspark columns with UDF in Python. The most useful feature of Spark SQL & DataFrame that is used to extend the PySpark build-in capabilities is known as UDF in Pyspark. There occurs various circumstances in which we need t
4 min read
Partition of Timestamp column in Dataframes Pyspark
In this article, we are going to learn the partitioning of timestamp column in data frames using Pyspark in Python. The timestamp column contains various time fields, such as year, month, week, day, hour, minute, second, millisecond, etc. There occurs various circumstances in which we don't want all
4 min read