Filtering rows based on column values in PySpark dataframe Last Updated : 29 Jun, 2021 Comments Improve Suggest changes Like Article Like Report In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration: Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of employee data data = [["1", "sravan", "company 1"], ["2", "ojaswi", "company 1"], ["3", "rohith", "company 2"], ["4", "sridevi", "company 1"], ["1", "sravan", "company 1"], ["4", "sridevi", "company 1"]] # specify column names columns = ['ID', 'NAME', 'Company'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() Output: Method 1: Using where() function This function is used to check the condition and give the results Syntax: dataframe.where(condition) We are going to filter the rows by using column values through the condition, where the condition is the dataframe condition Example 1: filter rows in dataframe where ID =1 Python3 # get the data where ID=1 dataframe.where(dataframe.ID=='1').show() Output: Example 2: Python3 # get the data where name not 'sravan' dataframe.where(dataframe.NAME != 'sravan').show() Output: Example 3: Where clause multiple column values filtering. Python program to filter rows where ID greater than 2 and college is vvit Python3 # filter rows where ID greater than 2 # and college is vvit dataframe.where((dataframe.ID>'2') & (dataframe.college=='vvit')).show() Output: Method 2: Using filter() function This function is used to check the condition and give the results. Syntax: dataframe.filter(condition) Example 1: Python code to get column value = vvit college Python3 # get the data where college is 'vvit' dataframe.filter(dataframe.college=='vvit').show() Output: Example 2: filter the data where id > 3. Python3 # get the data where id > 3 dataframe.filter(dataframe.ID>'3').show() Output: Example 3: Multiple column value filtering. Python program to filter rows where ID greater than 2 and college is vignan Python3 # filter rows where ID greater # than 2 and college is vignan dataframe.filter((dataframe.ID>'2') & (dataframe.college=='vignan')).show() Output: Comment More infoAdvertise with us Next Article Filtering rows based on column values in PySpark dataframe sravankumar_171fa07058 Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads Show distinct column values in PySpark dataframe In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct() and dropDuplicates() functions along with select() function. Let's create a sample dataframe. Python3 # importing module import pyspark # importing sparksessi 2 min read Filtering a row in PySpark DataFrame based on matching values from a list In this article, we are going to filter the rows in the dataframe based on matching values in the list by using isin in Pyspark dataframe isin(): This is used to find the elements contains in a given dataframe, it will take the elements and get the elements to match to the data Syntax: isin([element 2 min read Filter PySpark DataFrame Columns with None or Null Values Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter th 4 min read Add new column with default value in PySpark dataframe In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS 3 min read Removing duplicate rows based on specific column in PySpark DataFrame In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method: Syntax: dataframe.dropDuplicates(['column 1','column 1 min read Delete rows in PySpark dataframe based on multiple conditions In this article, we are going to see how to delete rows in PySpark dataframe based on multiple conditions. Method 1: Using Logical expression Here we are going to use the logical expression to filter the row. Filter() function is used to filter the rows from RDD/DataFrame based on the given conditio 2 min read Count rows based on condition in Pyspark Dataframe In this article, we will discuss how to count rows based on conditions in Pyspark dataframe. For this, we are going to use these methods: Using where() function.Using filter() function. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pysp 4 min read Filter Pandas Dataframe by Column Value Filtering a Pandas DataFrame by column values is a common and essential task in data analysis. It allows to extract specific rows based on conditions applied to one or more columns, making it easier to work with relevant subsets of data. Let's start with a quick example to illustrate the concept:Pyt 7 min read PySpark - Split dataframe by column value A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python. There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either 3 min read Python PySpark - DataFrame filter on multiple columns In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python. Creating Dataframe for demonestration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSessio 2 min read Like