How to Add Multiple Columns in PySpark Dataframes ?
In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes.
Let's create a sample dataframe for demonstration:
Dataset Used: Cricket_data_set_odi
# import pandas to read json file
import pandas as pd
# importing module
import pyspark
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# create Dataframe
df=spark.read.option(
"header",True).csv("Cricket_data_set_odi.csv")
# Display Schema
df.printSchema()
# Show Dataframe
df.show()
Output:

Method 1: Using withColumn()
withColumn() is used to add a new or update an existing column on DataFrame
Syntax: df.withColumn(colName, col)
Returns: A new :class:`DataFrame` by adding a column or replacing the existing column that has the same name.
Code:
df.withColumn(
'Avg_runs', df.Runs / df.Matches).withColumn(
'wkt+10', df.Wickets+10).show()
Output:

Method 2: Using select()
You can also add multiple columns using select.
Syntax: df.select(*cols)
Code:
# Using select() to Add Multiple Column
df.select('*', (df.Runs / df.Matches).alias('Avg_runs'),
(df.Wickets+10).alias('wkt+10')).show()
Output :

Method 3: Adding a Constant multiple Column to DataFrame Using withColumn() and select()
Let’s create a new column with constant value using lit() SQL function, on the below code. The lit() function present in Pyspark is used to add a new column in a Pyspark Dataframe by assigning a constant or literal value.
from pyspark.sql.functions import col, lit
df.select('*',lit("Cricket").alias("Sport")).
withColumn("Fitness",lit(("Good"))).show()
Output:
