Get specific row from PySpark dataframe
Last Updated :
18 Jul, 2021
In this article, we will discuss how to get the specific row from the PySpark dataframe.
Creating Dataframe for demonstration:
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 2" ],
[ "3" , "bobby" , "company 3" ],
[ "4" , "rohith" , "company 2" ],
[ "5" , "gnanesh" , "company 1" ]]
columns = [ 'Employee ID' , 'Employee NAME' ,
'Company Name' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
|
Output:

Method 1: Using collect()
This is used to get the all row’s data from the dataframe in list format.
Syntax: dataframe.collect()[index_position]
Where,
- dataframe is the pyspark dataframe
- index_position is the index row in dataframe
Example: Python code to access rows
Python3
print (dataframe.collect()[ 0 ])
print (dataframe.collect()[ 1 ])
print (dataframe.collect()[ - 1 ])
print (dataframe.collect()[ 2 ])
|
Output:
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)
Method 2: Using show()
This function is used to get the top n rows from the pyspark dataframe.
Syntax: dataframe.show(no_of_rows)
where, no_of_rows is the row number to get the data
Example: Python code to get the data using show() function
Python3
print (dataframe.show( 2 ))
print (dataframe.show( 1 ))
print (dataframe.show())
|
Output:

Method 3: Using first()
This function is used to return only the first row in the dataframe.
Syntax: dataframe.first()
Example: Python code to select the first row in the dataframe.
Output:
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Method 4: Using head()
This method is used to display top n rows in the dataframe.
Syntax: dataframe.head(n)
where, n is the number of rows to be displayed
Example: Python code to display the number of rows to be displayed.
Python3
print (dataframe.head( 1 ))
print (dataframe.head( 3 ))
print (dataframe.head( 2 ))
|
Output:
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′),
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]
Method 5: Using tail()
Used to return last n rows in the dataframe
Syntax: dataframe.tail(n)
where n is the no of rows to be returned from last in the dataframe.
Example: Python code to get last n rows
Python3
print (dataframe.tail( 1 ))
print (dataframe.tail( 3 ))
print (dataframe.tail( 2 ))
|
Output:
[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
[Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),
Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
Method 6: Using select() with collect() method
This method is used to select a particular row from the dataframe, It can be used with collect() function.
Syntax: dataframe.select([columns]).collect()[index]
where,
- dataframe is the pyspark dataframe
- Columns is the list of columns to be displayed in each row
- Index is the index number of row to be displayed.
Example: Python code to select the particular row.
Python3
print (dataframe.select([ 'Employee ID' ,
'Employee NAME' ,
'Company Name' ]).collect()[ 0 ])
print (dataframe.select([ 'Employee ID' ,
'Employee NAME' ,
'Company Name' ]).collect()[ 2 ])
print (dataframe.select([ 'Employee ID' ,
'Employee NAME' ,
'Company Name' ]).collect()[ 3 ])
|
Output:
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)
Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)
Method 7: Using take() method
This method is also used to select top n rows
Syntax: dataframe.take(n)
where n is the number of rows to be selected
Python3
print (dataframe.take( 2 ))
print (dataframe.take( 4 ))
print (dataframe.take( 1 ))
|
Output:
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′),
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),
Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]
Similar Reads
Split Dataframe in Row Index in Pyspark
In this article, we are going to learn about splitting Pyspark data frame by row index in Python. In data science. there is a bulk of data and their is need of data processing and lots of modules, functions and methods are available to process data. In this article we are going to process data by sp
5 min read
Drop rows containing specific value in PySpark dataframe
In this article, we are going to drop the rows with a specific value in pyspark dataframe. Creating dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an ap
2 min read
PySpark - Select Columns From DataFrame
In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected c
2 min read
PySpark Row using on DataFrame and RDD
You can access the rows in the data frame like this: Attribute, dictionary value. Row allows you to create row objects using named arguments. A named argument cannot be omitted to indicate that the value is "none" or does not exist. In this case, you should explicitly set this to None. Subsequent ch
6 min read
PySpark - Random Splitting Dataframe
In this article, we are going to learn how to randomly split data frame using PySpark in Python. A tool created by Apache Spark Community to use Python with Spark altogether is known as Pyspark. Thus, while working for the Pyspark data frame, Sometimes we are required to randomly split that data fra
7 min read
How take a random row from a PySpark DataFrame?
In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language. Method 1 : PySpark sample() method PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame. Here are the details of th
4 min read
Drop duplicate rows in PySpark DataFrame
In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python. Let's create a sample Dataframe C/C++ Code # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import
2 min read
PySpark Dataframe Split
PySpark is an open-source library used for handling big data. It is an interface of Apache Spark in Python. It is fast and also provides Pandas API to give comfortability to Pandas users while using PySpark. Dataframe is a data structure in which a large amount or even a small amount of data can be
4 min read
Update Pyspark Dataframe Metadata
In this article, we are going to learn how to update the PySpark data frame in Python. In this article, we will discuss how to update the metadata of a PySpark data frame. Specifically, we will cover the following topics: Understanding the importance of metadata in PySpark DataFramesHow to access an
11 min read
Sort Pyspark Dataframe Within Groups
Are you the one who likes to play with data in Python, especially the Pyspark data set? Then, you might know about various functions which you can apply to the dataset. But do you know that you can even rearrange the data either in ascending or descending order after grouping the same columns on the
7 min read