How take a random row from a PySpark DataFrame?
Last Updated :
30 Jan, 2022
In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language.
Method 1 : PySpark sample() method
PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame.
Here are the details of the sample() method :
Syntax : DataFrame.sample(withReplacement,fractionfloat,seed)
It returns a subset of the DataFrame.
Parameters :
withReplacement : bool, optional
Sample with replacement or not (default False).
fractionfloat : optional
Fraction of rows to generate
seed : int, optional
Used to reproduce the same random sampling.
Example:
In this example, we need to add a fraction of float data type here from the range [0.0,1.0]. Using the formula :
Number of rows needed = Fraction * Total Number of rows
We can say that the fraction needed for us is 1/total number of rows.
Python
import pyspark
from pyspark.sql import SparkSession
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
data = [[ 'a' , 1 ], [ 'b' , 2 ], [ 'c' , 3 ], [ 'd' , 4 ]]
columns = [ 'Letters' , 'Position' ]
df = random_row_session.createDataFrame(data,
columns)
df.show()
df2 = df.sample( False , 1.0 / len (df.collect()))
df2.show()
|
Output :
+-------+--------+
|Letters|Position|
+-------+--------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+--------+
+-------+--------+
|Letters|Position|
+-------+--------+
| b| 2|
+-------+--------+
Method 2: Using takeSample() method
We first convert the PySpark DataFrame to an RDD. Resilient Distributed Dataset (RDD) is the most simple and fundamental data structure in PySpark. They are immutable collections of data of any data type.
We can get RDD of a Data Frame using DataFrame.rdd and then use the takeSample() method.
Syntax of takeSample() :
takeSample(withReplacement, num, seed=None)
Parameters :
withReplacement : bool, optional
Sample with replacement or not (default False).
num : int
the number of sample values
seed : int, optional
Used to reproduce the same random sampling.
Returns : It returns num number of rows from the DataFrame.
Example: In this example, we are using takeSample() method on the RDD with the parameter num = 1 to get a Row object. num is the number of samples.
Python
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
data = [[ 'a' , 1 ], [ 'b' , 2 ], [ 'c' , 3 ], [ 'd' , 4 ]]
columns = [ 'Letters' , 'Position' ]
df = random_row_session.createDataFrame(data,
columns)
df.show()
rdd = df.rdd
rdd_sample = rdd.takeSample(withReplacement = False ,
num = 1 )
print (rdd_sample)
|
Output :
+-------+--------+
|Letters|Position|
+-------+--------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+--------+
[Row(Letters='c', Position=3)]
Method 3: Convert the PySpark DataFrame to a Pandas DataFrame and use the sample() method
We can use toPandas() function to convert a PySpark DataFrame to a Pandas DataFrame. This method should only be used if the resulting Pandas’ DataFrame is expected to be small, as all the data is loaded into the driver’s memory. This is an experimental method.
We will then use the sample() method of the Pandas library. It returns a random sample from an axis of the Pandas DataFrame.
Syntax : PandasDataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
Example:
In this example, we will be converting our PySpark DataFrame to a Pandas DataFrame and using the Pandas sample() function on it.
Python
import pyspark
from pyspark.sql import SparkSession
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
data = [[ 'a' , 1 ], [ 'b' , 2 ], [ 'c' , 3 ], [ 'd' , 4 ]]
columns = [ 'Letters' , 'Position' ]
df = random_row_session.createDataFrame(data,
columns)
df.show()
pandas_random = df.toPandas().sample()
df_random = random_row_session.createDataFrame(pandas_random)
df_random.show()
|
Output :
+-------+--------+
|Letters|Position|
+-------+--------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+--------+
+-------+--------+
|Letters|Position|
+-------+--------+
| b| 2|
+-------+--------+
Similar Reads
How to select a range of rows from a dataframe in PySpark ?
In this article, we are going to select a range of rows from a PySpark dataframe. It can be done in these ways: Using filter().Using where().Using SQL expression. Creating Dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from
3 min read
How to Randomly Select rows from Pandas DataFrame
In Pandas, it is possible to select rows randomly from a DataFrame with different methods. Randomly selecting rows can be useful for tasks like sampling, testing or data exploration. Creating Sample Pandas DataFrameFirst, we will create a sample Pandas DataFrame that we will use further in our artic
3 min read
PySpark Row using on DataFrame and RDD
You can access the rows in the data frame like this: Attribute, dictionary value. Row allows you to create row objects using named arguments. A named argument cannot be omitted to indicate that the value is "none" or does not exist. In this case, you should explicitly set this to None. Subsequent ch
6 min read
How to get a value from the Row object in PySpark Dataframe?
In this article, we are going to learn how to get a value from the Row object in PySpark DataFrame. Method 1 : Using __getitem()__ magic method We will create a Spark DataFrame with at least one row using createDataFrame(). We then get a Row object from a list of row objects returned by DataFrame.co
5 min read
Removing Blank Strings from a PySpark Dataframe
Cleaning and preprocessing data is a crucial step before it can be used for analysis or modeling. One of the common tasks in data preparation is removing empty strings from a Spark dataframe. A Spark dataframe is a distributed collection of data that is organized into rows and columns. It can be pro
4 min read
Get specific row from PySpark dataframe
In this article, we will discuss how to get the specific row from the PySpark dataframe. Creating Dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession # from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession # and giving an
4 min read
How to slice a PySpark dataframe in two row-wise dataframe?
In this article, we are going to learn how to slice a PySpark DataFrame into two row-wise. Slicing a DataFrame is getting a subset containing all rows from one index to another. Method 1: Using limit() and subtract() functions In this method, we first make a PySpark DataFrame with precoded data usin
4 min read
How to create an empty PySpark DataFrame ?
In PySpark, an empty DataFrame is one that contains no data. You might need to create an empty DataFrame for various reasons such as setting up schemas for data processing or initializing structures for later appends. In this article, weâll explore different ways to create an empty PySpark DataFrame
4 min read
Read Text file into PySpark Dataframe
In this article, we are going to see how to read text files in PySpark Dataframe. There are three ways to read text files into PySpark DataFrame. Using spark.read.text()Using spark.read.csv()Using spark.read.format().load() Using these we can read a single text file, multiple files, and all files fr
3 min read
How to Get substring from a column in PySpark Dataframe ?
In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column. We can get the substring of the column using substring() and substr() function. Syntax: substring(str,pos,len) df.col_n
3 min read