Add new column with default value in PySpark dataframe
Last Updated :
29 Jun, 2021
In this article, we are going to see how to add a new column with a default value in PySpark Dataframe.
The three ways to add a column to PandPySpark as DataFrame with Default Value.
- Using pyspark.sql.DataFrame.withColumn(colName, col)
- Using pyspark.sql.DataFrame.select(*cols)
- Using pyspark.sql.SparkSession.sql(sqlQuery)
Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)
It Adds a column or replaces the existing column that has the same name to a DataFrame and returns a new DataFrame with all existing columns to new ones. The column expression must be an expression over this DataFrame and adding a column from some other DataFrame will raise an error.
Syntax: pyspark.sql.DataFrame.withColumn(colName, col)
Parameters: This method accepts the following parameter as mentioned above and described below.
- colName: It is a string and contains name of the new column.
- col: It is a Column expression for the new column.
Returns: DataFrame
First, create a simple DataFrame.
Python3
import findspark
findspark.init()
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
# creating the session
spark = SparkSession.builder.getOrCreate()
# creating the dataframe
pandas_df = pd.DataFrame({
'Name': ['Anurag', 'Manjeet', 'Shubham',
'Saurabh', 'Ujjawal'],
'Address': ['Patna', 'Delhi', 'Coimbatore',
'Greater noida', 'Patna'],
'ID': [20123, 20124, 20145, 20146, 20147],
'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()
Output:
Add a new column with Default Value:
Python3
# Add new column with NUll
from pyspark.sql.functions import lit
df = df.withColumn("Rewards", lit(None))
df.show()
# Add new constanst column
df = df.withColumn("Bonus Percent", lit(0.25))
df.show()
Output:

Method 2: Using pyspark.sql.DataFrame.select(*cols)
We can use pyspark.sql.DataFrame.select() create a new column in DataFrame and set it to default values. It projects a set of expressions and returns a new DataFrame.
Syntax: pyspark.sql.DataFrame.select(*cols)
Parameters: This method accepts the following parameter as mentioned above and described below.
- cols: It contains column names (string) or expressions (Column)
Returns: DataFrame
First, create a simple DataFrame.
Python3
import findspark
findspark.init()
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
# creating the session
spark = SparkSession.builder.getOrCreate()
# creating the dataframe
pandas_df = pd.DataFrame({
'Name': ['Anurag', 'Manjeet', 'Shubham',
'Saurabh', 'Ujjawal'],
'Address': ['Patna', 'Delhi', 'Coimbatore',
'Greater noida', 'Patna'],
'ID': [20123, 20124, 20145, 20146, 20147],
'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()
Output:
Add a new column with Default Value:
Python3
# Add new column with NUll
from pyspark.sql.functions import lit
df = df.select('*', lit(None).alias("Rewards"))
# Add new constanst column
df = df.select('*', lit(0.25).alias("Bonus Percent"))
df.show()
Output:

Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)
We can use pyspark.sql.SparkSession.sql() create a new column in DataFrame and set it to default values. It returns a DataFrame representing the result of the given query.
Syntax: pyspark.sql.SparkSession.sql(sqlQuery)
Parameters: This method accepts the following parameter as mentioned above and described below.
- sqlQuery: It is a string and contains the sql executable query.
Returns: DataFrame
First, create a simple DataFrame:
Python3
import findspark
findspark.init()
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
# creating the session
spark = SparkSession.builder.getOrCreate()
# creating the dataframe
pandas_df = pd.DataFrame({
'Name': ['Anurag', 'Manjeet', 'Shubham',
'Saurabh', 'Ujjawal'],
'Address': ['Patna', 'Delhi', 'Coimbatore',
'Greater noida', 'Patna'],
'ID': [20123, 20124, 20145, 20146, 20147],
'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()
Output:
Add a new column with Default Value:
Python3
# Add columns to DataFrame using SQL
df.createOrReplaceTempView("GFG_Table")
# Add new column with NUll
df=spark.sql("select *, null as Rewards from GFG_Table")
# Add new constanst column
df.createOrReplaceTempView("GFG_Table")
df=spark.sql("select *, '0.25' as Bonus_Percent from GFG_Table")
df.show()
Output:
Similar Reads
Add a column with the literal value in PySpark DataFrame In this article, we are going to see how to add a column with the literal value in PySpark Dataframe. Creating dataframe for demonstration: Python3 # import SparkSession from the pyspark from pyspark.sql import SparkSession # build and create the # SparkSession with name "lit_value" spark = SparkSes
3 min read
Filter PySpark DataFrame Columns with None or Null Values Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter th
4 min read
How to drop all columns with null values in a PySpark DataFrame ? The pyspark.sql.DataFrameNaFunctions class in PySpark has many methods to deal with NULL/None values, one of which is the drop() function, which is used to remove/delete rows containing NULL values in DataFrame columns. You can also use df.dropna(), as shown in this article. You may drop all rows in
3 min read
How to add column sum as new column in PySpark dataframe ? In this article, we are going to see how to perform the addition of New columns in Pyspark dataframe by various methods. It means that we want to create a new column that will contain the sum of all values present in the given row. Now let's discuss the various methods how we add sum as new columns
4 min read
Create new column with function in Spark Dataframe In this article, we are going to learn how to create a new column with a function in the PySpark data frame in Python. PySpark is a popular Python library for distributed data processing that provides high-level APIs for working with big data. The data frame class is a key component of PySpark, as i
3 min read