How to Extract random sample of rows in R DataFrame with nested condition

Last Updated : 24 Jun, 2021

In this article, we will learn how to extract random samples of rows in a DataFrame in R programming language with a nested condition.

Method 1: Using sample()

We will be using the sample() function to carry out this task. sample() function in R Language creates random samples based on the parameters provided in the function call. It takes either a vector or a positive integer as the object in the function parameter.

Another function which we will be using is which(). This function will help us provide conditions according to which samples will be extracted. which() function returns the elements (along with indices of the elements) which satisfy the condition given in the parameters.

Syntax: df[ sample(which ( conditions ) ,n), ]

Parameters:

df: DataFrame
n: number of samples to be generated
conditions: samples are extracted according to this condition. Ex: df$year > 5

DataFrame in Use:

	name	year	length	education
1	Welcome	10	40	yes
2	to	51	NA	yes
3	Geeks	19	NA	no
4	for	126	100	no
5	Geeks	99	95	yes

Thus, to realize this approach the dataframe is first created and then passed to sample() along with the condition that will be used to extract rows from the dataframe. Given below are implementations that uses the above dataframe to illustrate the same.

Example 1:

df <- data.frame( name = c("Welcome", "to", "Geeks",
                           "for", "Geeks"),
                 
                year = c(10, 51, 19, 126, 99),
                 
                length = c(40, NA, NA, 100, 95),
                 
                education = c("yes", "yes", "no", 
                              "no", "yes") )
df

# Printing 2 rows
print("2 samples")
df[ sample(which (df$year > 5) ,2), ]

Output:

   name year length education
1 Welcome   10     40       yes
2      to   51     NA       yes
3   Geeks   19     NA        no
4     for  126    100        no
5   Geeks   99     95       yes
[1] "2 samples"
     name year length education
1 Welcome   10     40       yes
2      to   51     NA       yes

Example 2:

df <- data.frame( name = c("Welcome", "to", "Geeks", 
                           "for", "Geeks"),
                 
                year = c(10, 51, 19, 126, 99),
                 
                length = c(40, NA, NA, 100, 95),
                 
                education = c("yes", "yes", "no", 
                              "no", "yes") )
df

# Printing 3 rows
print("3 samples")
df[ sample(which (df$education !="no") ,3), ]

Output:

       name year length education
1 Welcome   10     40       yes
2      to   51     NA       yes
3   Geeks   19     NA        no
4     for  126    100        no
5   Geeks   99     95       yes
[1] "3 samples"
     name year length education
5   Geeks   99     95       yes
1 Welcome   10     40       yes
2      to   51     NA       yes

Method 2: Using sample_n() function

sample_n() function in R Language is used to take random sample specimens from a data frame.

Syntax: sample_n(x, n)

Parameters:

x: Data Frame
n: size/number of items to select

Along with sample_n() function, we have also used filter() function. The filter() function in R Language is used to choose cases and filtering out the values based on the filtering expression.

Syntax: filter(x, expr)

Parameters:

x: Object to be filtered
expr: expression as a base for filtering

We have loaded the dplyr package as it contains both filter() and sample_n() function. In the parameters of the filter function, we have passed our sample dataframe->df and our Nested conditional as arguments. Then we have used our sample_n() function to extract the "n" number of samples from the dataframe after satisfying the conditions.

Syntax: filter(df, condition) %>% sample_n(., n)

Parameters:

df: Dataframe Object
condition: Nested conditionals. Ex: df$name != "to"
n: Number of samples

Example 1:

library(dplyr)

df <- data.frame( name = c("Welcome", "to", "Geeks",
                           "for", "Geeks"),
                 
                year = c(10, 51, 19, 126, 99),
                 
                length = c(40, NA, NA, 100, 95),
                 
                education = c("yes", "yes", "no",
                              "no", "yes") )
df

# Printing 2 rows
print("2 samples")

filter(df, df$name != "to") %>% sample_n(., 2)

Output:

 name year length education
1 Welcome   10     40       yes
2      to   51     NA       yes
3   Geeks   19     NA        no
4     for  126    100        no
5   Geeks   99     95       yes
[1] "2 samples"
     name year length education
1 Welcome   10     40       yes
2   Geeks   99     95       yes

Example 2:

library(dplyr)

df <- data.frame( name = c("Welcome", "to", "Geeks",
                           "for", "Geeks"),
                year = c(10, 51, 19, 126, 99),
                 
                length = c(40, NA, NA, 100, 95),
                 
                education = c("yes", "yes", "no", 
                              "no", "yes") )
df

# Printing 2 rows
print("2 samples")

filter(df, df$year >20 ) %>% sample_n(., 2)

Output:

 name year length education
1 Welcome   10     40       yes
2      to   51     NA       yes
3   Geeks   19     NA        no
4     for  126    100        no
5   Geeks   99     95       yes
[1] "2 samples"
  name year length education
1  for  126    100        no
2   to   51     NA       yes