Add new column with default value in PySpark dataframe

Last Updated : 29 Jun, 2021

In this article, we are going to see how to add a new column with a default value in PySpark Dataframe.

The three ways to add a column to PandPySpark as DataFrame with Default Value.

Using pyspark.sql.DataFrame.withColumn(colName, col)
Using pyspark.sql.DataFrame.select(*cols)
Using pyspark.sql.SparkSession.sql(sqlQuery)

Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)

It Adds a column or replaces the existing column that has the same name to a DataFrame and returns a new DataFrame with all existing columns to new ones. The column expression must be an expression over this DataFrame and adding a column from some other DataFrame will raise an error.

Syntax: pyspark.sql.DataFrame.withColumn(colName, col)

Parameters: This method accepts the following parameter as mentioned above and described below.

colName: It is a string and contains name of the new column.
col: It is a Column expression for the new column.

Returns: DataFrame

First, create a simple DataFrame.

Python3

import findspark
findspark.init()

# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession

# creating the session
spark = SparkSession.builder.getOrCreate()

# creating the dataframe
pandas_df = pd.DataFrame({
    'Name': ['Anurag', 'Manjeet', 'Shubham',
             'Saurabh', 'Ujjawal'],
    'Address': ['Patna', 'Delhi', 'Coimbatore',
                'Greater noida', 'Patna'],
    'ID': [20123, 20124, 20145, 20146, 20147],
    'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()

Output:

Add a new column with Default Value:

Python3

# Add new column with NUll
from pyspark.sql.functions import lit
df = df.withColumn("Rewards", lit(None))
df.show()

# Add new constanst column
df = df.withColumn("Bonus Percent", lit(0.25))
df.show()

Output:

**Method 2: Using pyspark.sql.DataFrame.select(*cols)**

We can use pyspark.sql.DataFrame.select() create a new column in DataFrame and set it to default values. It projects a set of expressions and returns a new DataFrame.

Syntax: pyspark.sql.DataFrame.select(*cols)

Parameters: This method accepts the following parameter as mentioned above and described below.

cols: It contains column names (string) or expressions (Column)

Returns: DataFrame

First, create a simple DataFrame.

Python3

import findspark
findspark.init()

# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession

# creating the session
spark = SparkSession.builder.getOrCreate()

# creating the dataframe
pandas_df = pd.DataFrame({
    'Name': ['Anurag', 'Manjeet', 'Shubham',
             'Saurabh', 'Ujjawal'],
    'Address': ['Patna', 'Delhi', 'Coimbatore',
                'Greater noida', 'Patna'],
    'ID': [20123, 20124, 20145, 20146, 20147],
    'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()

Output:

Add a new column with Default Value:

Python3

# Add new column with NUll
from pyspark.sql.functions import lit
df = df.select('*', lit(None).alias("Rewards"))

# Add new constanst column
df = df.select('*', lit(0.25).alias("Bonus Percent"))
df.show()

Output:

Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)

We can use pyspark.sql.SparkSession.sql() create a new column in DataFrame and set it to default values. It returns a DataFrame representing the result of the given query.

Syntax: pyspark.sql.SparkSession.sql(sqlQuery)

Parameters: This method accepts the following parameter as mentioned above and described below.

sqlQuery: It is a string and contains the sql executable query.

Returns: DataFrame

First, create a simple DataFrame:

Python3

import findspark
findspark.init()

# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession

# creating the session
spark = SparkSession.builder.getOrCreate()

# creating the dataframe
pandas_df = pd.DataFrame({
    'Name': ['Anurag', 'Manjeet', 'Shubham',
             'Saurabh', 'Ujjawal'],
    'Address': ['Patna', 'Delhi', 'Coimbatore',
                'Greater noida', 'Patna'],
    'ID': [20123, 20124, 20145, 20146, 20147],
    'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()

Output:

Add a new column with Default Value:

Python3

# Add columns to DataFrame using SQL
df.createOrReplaceTempView("GFG_Table")

# Add new column with NUll
df=spark.sql("select *, null as Rewards from GFG_Table")

# Add new constanst column
df.createOrReplaceTempView("GFG_Table")
df=spark.sql("select *, '0.25' as Bonus_Percent from GFG_Table")
df.show()

Output:

Add a column with the literal value in PySpark DataFrame

SHUBHAMSINGH10

Improve

Article Tags :

Practice Tags :

python

Add new column with default value in PySpark dataframe

Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)

Method 2: Using pyspark.sql.DataFrame.select(*cols)

Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)

Similar Reads

Thank You!

What kind of Experience do you want to share?

**Method 2: Using pyspark.sql.DataFrame.select(*cols)**