PySpark - Select columns by type

Last Updated : 19 Dec, 2021

In this article, we will discuss how to select columns by type in PySpark using Python.

Let's create a dataframe for demonstration

Python3

# importing module
import pyspark

# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession

# import data field types
from pyspark.sql.types import StringType, DoubleType, 
IntegerType, StructType, StructField, FloatType


# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list  of student  data
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi",
                                      9.2, 6789.00),
        (3, "bobby", 8.9, 988.000)]

# specify column names with data types
columns = StructType([
    StructField("ID", IntegerType(), True),
    StructField("NAME", StringType(), True),
    StructField("GPA", FloatType(), True),
    StructField("FEE", DoubleType(), True),

])

# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)


# display
dataframe.show()

Output:

We can select the column by name using the following keywords:

Integer: int
String : string
Float: float
Double: double

Method 1: Using dtypes()

Here we are using dtypes followed by startswith() method to get the columns of a particular type.

Syntax: dataframe[[item[0] for item in dataframe.dtypes if item[1].startswith('datatype')]]

where,

dataframe is the input dataframe
datatype refers the keyword types
item defines the values in the column

And finally, we are using collect() method to display column data

Python3

# importing module
import pyspark

# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession

# import data field types
from pyspark.sql.types import (StringType,
DoubleType, IntegerType, StructType,
StructField, FloatType)


# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list  of student  data
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi",
                                      9.2, 6789.00), 
        (3, "bobby", 8.9, 988.000)]

# specify column names with data types
columns = StructType([
    StructField("ID", IntegerType(), True),
    StructField("NAME", StringType(), True),
    StructField("GPA", FloatType(), True),
    StructField("FEE", DoubleType(), True),

])

# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)

# select columns that are integer type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('int')]].collect())

# select columns that are string type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('string')]].collect())

# select columns that are float type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('float')]].collect())

# select columns that are double type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('double')]].collect())

Output:

[Row(ID=1), Row(ID=2), Row(ID=3)]

[Row(NAME='sravan'), Row(NAME='ojsawi'), Row(NAME='bobby')]

[Row(GPA=9.800000190734863), Row(GPA=9.199999809265137), Row(GPA=8.899999618530273)]

[Row(FEE=4500.0), Row(FEE=6789.0), Row(FEE=988.0)]

Method 2: Using schema.fields

Here we are using schema.fields method to get the type of the columns. We are checking the particular type using methods that are available in pyspark.sql.types module.

Let's check one by one:

Integer - IntegerType
Float-FloatType
Double - DoubleType
String- StringType

We are using isinstance() operator to check with these data types.

Syntax: dataframe[[f.name for f in dataframe.schema.fields if isinstance(f.dataType, datatype)]]

where,

dataframe is the input dataframe
name is the values
datatype refers to above types

Python3

# importing module
import pyspark

# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession

# import data field types
from pyspark.sql.types import StringType, DoubleType,
IntegerType, StructType, StructField, FloatType


# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list  of student  data
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi",
                                      9.2, 6789.00), 
        (3, "bobby", 8.9, 988.000)]

# specify column names with data types
columns = StructType([
    StructField("ID", IntegerType(), True),
    StructField("NAME", StringType(), True),
    StructField("GPA", FloatType(), True),
    StructField("FEE", DoubleType(), True),

])

# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)

# select columns that are integer type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, IntegerType)]].collect())

# select columns that are string type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, StringType)]].collect())

# select columns that are float type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, FloatType)]].collect())

# select columns that are double type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, DoubleType)]].collect())

Output:

[Row(ID=1), Row(ID=2), Row(ID=3)]

[Row(NAME='sravan'), Row(NAME='ojsawi'), Row(NAME='bobby')]

[Row(GPA=9.800000190734863), Row(GPA=9.199999809265137), Row(GPA=8.899999618530273)]

[Row(FEE=4500.0), Row(FEE=6789.0), Row(FEE=988.0)]

Python Introduction

sravankumar_171fa07058

Improve

Article Tags :

Practice Tags :

python

PySpark - Select columns by type

Let's create a dataframe for demonstration

Method 1: Using dtypes()

Method 2: Using schema.fields

Similar Reads

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Thank You!

What kind of Experience do you want to share?