Python PySpark - Drop columns based on column names or String condition

Last Updated : 27 Mar, 2023

In this article, we will be looking at the step-wise approach to dropping columns based on column names or String conditions in PySpark.

Stepwise Implementation

Step1: Create CSV

Under this step, we are simply creating a CSV file with three rows and columns.

CSV Used:

Step 2: Import PySpark Library

Under this step, we are importing the PySpark packages to use its functionality by using the below syntax:

import pyspark

Step 3: Start a SparkSession

In this step we are simply starting our spark session using the SparkSession.builder.appName() function.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(
    'GeeksForGeeks').getOrCreate()  # You can use any appName
print(spark)

Output:

Step 4: Read our CSV

To read our CSV we use spark.read.csv(). It has 2 parameters:

header = True [Sets column names to First row in the CSV]
inferSchema = True [Sets the right datatypes for the column elements]

df = spark.read.csv('book1.csv', header=True, inferSchema=True)
df.show()

Output:

Step 5: Drop Column based on Column Name

Finally, we can see how simple it is to Drop a Column based on the Column Name.

To Drop a column we use DataFrame.drop(). And to the result to it, we will see that the Gender column is now not part of the Dataframe. see

df = df.drop("Gender")
df.show()

Python PySpark - Drop columns based on column names or String condition

ayushmankumar7

Improve

Article Tags :

Practice Tags :

python

Python PySpark - Drop columns based on column names or String condition

Stepwise Implementation

Step1: Create CSV

Step 2: Import PySpark Library

Step 3: Start a SparkSession

Step 4: Read our CSV

Step 5: Drop Column based on Column Name

Similar Reads

Thank You!

What kind of Experience do you want to share?

Log in

Create Account