Add new column with default value in PySpark dataframe
Last Updated :
29 Jun, 2021
In this article, we are going to see how to add a new column with a default value in PySpark Dataframe.
The three ways to add a column to PandPySpark as DataFrame with Default Value.
- Using pyspark.sql.DataFrame.withColumn(colName, col)
- Using pyspark.sql.DataFrame.select(*cols)
- Using pyspark.sql.SparkSession.sql(sqlQuery)
Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)
It Adds a column or replaces the existing column that has the same name to a DataFrame and returns a new DataFrame with all existing columns to new ones. The column expression must be an expression over this DataFrame and adding a column from some other DataFrame will raise an error.
Syntax: pyspark.sql.DataFrame.withColumn(colName, col)
Parameters: This method accepts the following parameter as mentioned above and described below.
- colName: It is a string and contains name of the new column.
- col: It is a Column expression for the new column.
Returns: DataFrame
First, create a simple DataFrame.
Python3
import findspark
findspark.init()
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
# creating the session
spark = SparkSession.builder.getOrCreate()
# creating the dataframe
pandas_df = pd.DataFrame({
'Name': ['Anurag', 'Manjeet', 'Shubham',
'Saurabh', 'Ujjawal'],
'Address': ['Patna', 'Delhi', 'Coimbatore',
'Greater noida', 'Patna'],
'ID': [20123, 20124, 20145, 20146, 20147],
'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()
Output:
Add a new column with Default Value:
Python3
# Add new column with NUll
from pyspark.sql.functions import lit
df = df.withColumn("Rewards", lit(None))
df.show()
# Add new constanst column
df = df.withColumn("Bonus Percent", lit(0.25))
df.show()
Output:

Method 2: Using pyspark.sql.DataFrame.select(*cols)
We can use pyspark.sql.DataFrame.select() create a new column in DataFrame and set it to default values. It projects a set of expressions and returns a new DataFrame.
Syntax: pyspark.sql.DataFrame.select(*cols)
Parameters: This method accepts the following parameter as mentioned above and described below.
- cols: It contains column names (string) or expressions (Column)
Returns: DataFrame
First, create a simple DataFrame.
Python3
import findspark
findspark.init()
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
# creating the session
spark = SparkSession.builder.getOrCreate()
# creating the dataframe
pandas_df = pd.DataFrame({
'Name': ['Anurag', 'Manjeet', 'Shubham',
'Saurabh', 'Ujjawal'],
'Address': ['Patna', 'Delhi', 'Coimbatore',
'Greater noida', 'Patna'],
'ID': [20123, 20124, 20145, 20146, 20147],
'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()
Output:
Add a new column with Default Value:
Python3
# Add new column with NUll
from pyspark.sql.functions import lit
df = df.select('*', lit(None).alias("Rewards"))
# Add new constanst column
df = df.select('*', lit(0.25).alias("Bonus Percent"))
df.show()
Output:

Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)
We can use pyspark.sql.SparkSession.sql() create a new column in DataFrame and set it to default values. It returns a DataFrame representing the result of the given query.
Syntax: pyspark.sql.SparkSession.sql(sqlQuery)
Parameters: This method accepts the following parameter as mentioned above and described below.
- sqlQuery: It is a string and contains the sql executable query.
Returns: DataFrame
First, create a simple DataFrame:
Python3
import findspark
findspark.init()
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
# creating the session
spark = SparkSession.builder.getOrCreate()
# creating the dataframe
pandas_df = pd.DataFrame({
'Name': ['Anurag', 'Manjeet', 'Shubham',
'Saurabh', 'Ujjawal'],
'Address': ['Patna', 'Delhi', 'Coimbatore',
'Greater noida', 'Patna'],
'ID': [20123, 20124, 20145, 20146, 20147],
'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()
Output:
Add a new column with Default Value:
Python3
# Add columns to DataFrame using SQL
df.createOrReplaceTempView("GFG_Table")
# Add new column with NUll
df=spark.sql("select *, null as Rewards from GFG_Table")
# Add new constanst column
df.createOrReplaceTempView("GFG_Table")
df=spark.sql("select *, '0.25' as Bonus_Percent from GFG_Table")
df.show()
Output:
Similar Reads
Add a column with the literal value in PySpark DataFrame
In this article, we are going to see how to add a column with the literal value in PySpark Dataframe. Creating dataframe for demonstration: Python3 # import SparkSession from the pyspark from pyspark.sql import SparkSession # build and create the # SparkSession with name "lit_value" spark = SparkSes
3 min read
Filter PySpark DataFrame Columns with None or Null Values
Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter th
4 min read
How to drop all columns with null values in a PySpark DataFrame ?
The pyspark.sql.DataFrameNaFunctions class in PySpark has many methods to deal with NULL/None values, one of which is the drop() function, which is used to remove/delete rows containing NULL values in DataFrame columns. You can also use df.dropna(), as shown in this article. You may drop all rows in
3 min read
Filtering rows based on column values in PySpark dataframe
In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
2 min read
How to add column sum as new column in PySpark dataframe ?
In this article, we are going to see how to perform the addition of New columns in Pyspark dataframe by various methods. It means that we want to create a new column that will contain the sum of all values present in the given row. Now let's discuss the various methods how we add sum as new columns
4 min read
Create new column with function in Spark Dataframe
In this article, we are going to learn how to create a new column with a function in the PySpark data frame in Python. PySpark is a popular Python library for distributed data processing that provides high-level APIs for working with big data. The data frame class is a key component of PySpark, as i
3 min read
Show distinct column values in PySpark dataframe
In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct() and dropDuplicates() functions along with select() function. Let's create a sample dataframe. Python3 # importing module import pyspark # importing sparksessi
2 min read
How to add a new column to a PySpark DataFrame ?
In this article, we will discuss how to add a new column to PySpark Dataframe. Create the first data frame for demonstration: Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose. Python3 # importing module import pyspark # importing spark
9 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing
2 min read
How to Add Multiple Columns in PySpark Dataframes ?
In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio
2 min read