How to Add Multiple Columns in PySpark Dataframes ?

How to Add Multiple Columns in PySpark Dataframes ?

Last Updated : 30 Jun, 2021

In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes.

Let's create a sample dataframe for demonstration:

Dataset Used: Cricket_data_set_odi

# import pandas to read json file
import pandas as pd

# importing module
import pyspark

# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()


# create Dataframe
df=spark.read.option(
    "header",True).csv("Cricket_data_set_odi.csv")

# Display Schema
df.printSchema()

# Show Dataframe
df.show()

Output:

Method 1: Using withColumn()

withColumn() is used to add a new or update an existing column on DataFrame

Syntax: df.withColumn(colName, col)

Returns: A new :class:`DataFrame` by adding a column or replacing the existing column that has the same name.

Code:

df.withColumn(
    'Avg_runs', df.Runs / df.Matches).withColumn(
    'wkt+10', df.Wickets+10).show()

Output:

Method 2: Using select()

You can also add multiple columns using select.

Syntax: df.select(*cols)

Code:

# Using select() to Add Multiple Column
df.select('*', (df.Runs / df.Matches).alias('Avg_runs'),
          (df.Wickets+10).alias('wkt+10')).show()

Output :

Method 3: Adding a Constant multiple Column to DataFrame Using withColumn() and select()

Let’s create a new column with constant value using lit() SQL function, on the below code. The lit() function present in Pyspark is used to add a new column in a Pyspark Dataframe by assigning a constant or literal value.

from pyspark.sql.functions import col, lit


df.select('*',lit("Cricket").alias("Sport")).
withColumn("Fitness",lit(("Good"))).show()

Output:

How to Add Multiple Columns in PySpark Dataframes ?

K

kg_code

Improve

Article Tags :

Practice Tags :

python

Similar Reads

How to rename multiple columns in PySpark dataframe ?

In this article, we are going to see how to rename multiple columns in PySpark Dataframe. Before starting let's create a dataframe using pyspark: Python3 # importing module import pyspark from pyspark.sql.functions import col # importing sparksession from pyspark.sql module from pyspark.sql import S

How to Rename Multiple PySpark DataFrame Columns

In this article, we will discuss how to rename the multiple columns in PySpark Dataframe. For this we will use withColumnRenamed() and toDF() functions. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql i

How to select and order multiple columns in Pyspark DataFrame ?

In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy

How to name aggregate columns in PySpark DataFrame ?

In this article, we are going to see how to name aggregate columns in the Pyspark dataframe. We can do this by using alias after groupBy(). groupBy() is used to join two columns and it is used to aggregate the columns, alias is used to change the name of the new column which is formed by grouping da

How to change dataframe column names in PySpark ?

In this article, we are going to see how to change the column names in the pyspark data frame.Â Let's create a Dataframe for demonstration: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example jo