Get specific row from PySpark dataframe

Last Updated : 18 Jul, 2021

In this article, we will discuss how to get the specific row from the PySpark dataframe.

Creating Dataframe for demonstration:

Python3

# importing module
import pyspark

# importing sparksession
# from pyspark.sql module
from pyspark.sql import SparkSession

# creating sparksession
# and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"]]

# specify column names
columns = ['Employee ID', 'Employee NAME',
           'Company Name']

# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)

# display dataframe
dataframe.show()

Output:

Method 1: Using collect()

This is used to get the all row's data from the dataframe in list format.

Syntax: dataframe.collect()[index_position]

Where,

dataframe is the pyspark dataframe
index_position is the index row in dataframe

Example: Python code to access rows

Python3

# get first row
print(dataframe.collect()[0])

# get second row
print(dataframe.collect()[1])

# get last row
print(dataframe.collect()[-1])

# get third row
print(dataframe.collect()[2])

Output:

Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')

Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2')

Row(Employee ID='5', Employee NAME='gnanesh', Company Name='company 1')

Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3')

Method 2: Using show()

This function is used to get the top n rows from the pyspark dataframe.

Syntax: dataframe.show(no_of_rows)

where, no_of_rows is the row number to get the data

Example: Python code to get the data using show() function

Python3

# display dataframe only top 2 rows
print(dataframe.show(2))

# display dataframe only top 1 row
print(dataframe.show(1))

# display dataframe 
print(dataframe.show())

Output:

Method 3: Using first()

This function is used to return only the first row in the dataframe.

Syntax: dataframe.first()

Example: Python code to select the first row in the dataframe.

Python3

# display first row of the dataframe
print(dataframe.first())

Output:

Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')

Method 4: Using head()

This method is used to display top n rows in the dataframe.

Syntax: dataframe.head(n)

where, n is the number of rows to be displayed

Example: Python code to display the number of rows to be displayed.

Python3

# display only 1 row
print(dataframe.head(1))

# display only top 3  rows
print(dataframe.head(3))

# display only top 2 rows
print(dataframe.head(2))

Output:

[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')]

[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1'),

Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2'),

Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3')]

[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1'),

Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2')]

Method 5: Using tail()

Used to return last n rows in the dataframe

Syntax: dataframe.tail(n)

where n is the no of rows to be returned from last in the dataframe.

Example: Python code to get last n rows

Python3

# display only 1 row from last
print(dataframe.tail(1))

# display only top 3  rows from last
print(dataframe.tail(3))

# display only top 2 rows from last
print(dataframe.tail(2))

Output:

[Row(Employee ID='5', Employee NAME='gnanesh', Company Name='company 1')]

[Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3'),

Row(Employee ID='4', Employee NAME='rohith', Company Name='company 2'),

Row(Employee ID='5', Employee NAME='gnanesh', Company Name='company 1')]

[Row(Employee ID='4', Employee NAME='rohith', Company Name='company 2'),

Row(Employee ID='5', Employee NAME='gnanesh', Company Name='company 1')]

Method 6: Using select() with collect() method

This method is used to select a particular row from the dataframe, It can be used with collect() function.

Syntax: dataframe.select([columns]).collect()[index]

where,

dataframe is the pyspark dataframe
Columns is the list of columns to be displayed in each row
Index is the index number of row to be displayed.

Example: Python code to select the particular row.

Python3

# select first row
print(dataframe.select(['Employee ID',
                        'Employee NAME',
                        'Company Name']).collect()[0])

# select third row
print(dataframe.select(['Employee ID',
                        'Employee NAME',
                        'Company Name']).collect()[2])

# select forth row
print(dataframe.select(['Employee ID',
                        'Employee NAME',
                        'Company Name']).collect()[3])

Output:

Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')

Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3')

Row(Employee ID='4', Employee NAME='rohith', Company Name='company 2')

Method 7: Using take() method

This method is also used to select top n rows

Syntax: dataframe.take(n)

where n is the number of rows to be selected

Python3

# select top 2 rows
print(dataframe.take(2))

# select top 4 rows
print(dataframe.take(4))

# select top 1 row
print(dataframe.take(1))

Output:

[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1'),

Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2')]

[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1'),

Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2'),

Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3'),

Row(Employee ID='4', Employee NAME='rohith', Company Name='company 2')]

[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')]

PySpark - Select Columns From DataFrame

sravankumar_171fa07058

Improve

Article Tags :

Practice Tags :

python

Get specific row from PySpark dataframe

Method 1: Using collect()

Method 2: Using show()

Method 3: Using first()

Method 4: Using head()

Method 5: Using tail()

Method 6: Using select() with collect() method

Method 7: Using take() method

Similar Reads

Thank You!

What kind of Experience do you want to share?