
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Filter PySpark DataFrame Using isin by Exclusion
Python is an object-oriented, dynamically semantic, high-level, interpreted programming language. Rapid Application Development, as well as used as a scripting or glue language to- bring existing components together, find its high-level built-in data structures, coupled with dynamic type and dynamic binding, to be particularly appealing.
PySpark Dataframe
Data is organized into named columns in PySpark dataframes which are distributed collections of data that can be run on different computers. These dataframes may draw from existing resilient distributed datasets (RDDs), external databases, or structured data files.
Syntax - Isin ()
isin(list_)
The list_ a=parameter takes the value of a column as a list of values.
This is used to check or filter whether Dataframe values are present or included in the list of values, use PySpark' s isin () or IN operator.
The Column class's isin () function returns a boolean value. True if the evaluated values of the parameters include the expression's value.
Example 1
# Import the necessary libraries from pyspark.sql.functions import col # Create a PySpark DataFrame data = [("Karthik", 25), ("Vijay", 30), ("Kruthik", 35), ("Sricharan", 40), ("Aditi", 45)] df = spark.createDataFrame(data, ["name", "age"]) # Define the list of values to exclude exclusion_list = [25, 30] # Filter out rows where age is in the exclusion list filtered_df = df.filter(~col("age").isin(exclusion_list)) # Display the filtered DataFrame filtered_df.show()
Output
+------+---+ | name|age| +------+---+ |Kruthik| 35| | Sricharan| 40| | Aditi| 45| +------+---+
Example 2
# Import the necessary libraries from pyspark.sql.functions import col # Create a PySpark DataFrame data = [("Karthik", "New York"), ("Vijay", "Chicago"), ("Kruthik", "San Francisco"), ("Sricharan", "Los Angeles"), ("Aditi", "Miami")] df = spark.createDataFrame(data, ["name", "city"]) # Define the list of values to exclude exclusion_list = ["New York", "Chicago", "Miami"] # Filter out rows where city is in the exclusion list filtered_df = df.filter(~col("city").isin(exclusion_list)) # Display the filtered DataFrame filtered_df.show()
Output
+------+--------------+ | name| city| +------+--------------+ |Sricharan | Los Angeles| |Kruthik|San Francisco| +------+--------------+
Creating Dataframe for Demonstration
Example 1
# Create a PySpark DataFrame data = [("Alice", "New York"), ("Bob", "Chicago"), ("Charlie", "San Francisco"), ("David", "Los Angeles"), ("Eva", "Miami")] df = spark.createDataFrame(data, ["name", "city"]) # Define the list of values to exclude exclusion_list = ["New York", "Chicago", "Miami"] # Filter out rows where city is in the exclusion list and name is not "David" filtered_df = df.filter(~(col("city").isin(exclusion_list) & (col("name") != "David"))) # Display the filtered DataFrame filtered_df.show()
Output
+------+--------------+ | name| city| +------+--------------+ |David | Los Angeles| |Charlie|San Francisco| +------+--------------+
Example 2
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of students data with null values # we can define null values with none data = [[1, "Karthik", "Sharma"], [2, "Kruthik", "Ballari"], [3, "Vijay", "Kumar"], [4, "Aditi", "Gupta"], [5, "Sricharan", "Sharma"]] # specify column names columns = ['ID', 'NAME', 'Lastname'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show()
Output
+------+---------+----------------+ ID NAME LASTNAME +------+---------+----------------+ 1 |Karthik| Sharma | 2 |Kruthik| Ballari | 3 |Vijay| Kumar | 4 |Aditi| Gupta | 5 |Sricharan| Sharma |
Example 3
The following code illustrates fetching names from dataframe columns and displays them.
filter(): This clause is used to check the condition and give the results, Both are similar
Syntax
dataframe.filter(condition)
# Getting Kruthik's name dataframe.filter((dataframe.NAME).isin(['Kruthik'])).show()
Output
+------+---------+----------------+ ID NAME LASTNAME +------+---------+----------------+ 1 |Kruthik| Ballari |
Example 4
The following program illustrates fetching of data from dataframe whose last name is Sharma and prints their full names.
where(): This clause is used to check the condition and give the results
Syntax
dataframe.where(condition)
# Fetching names of people whose last name is Sharma dataframe.where((dataframe.college).isin(['Sharma'])).show()
Output
+------+---------+----------------+ ID NAME LASTNAME +------+---------+----------------+ 1 |Karthik| Sharma | 2 |Sricharan| Sharma |
Conclusion
The isin by exclusion function is a very helpful way to filter out rows from a PySpark dataframe where a column value does not match a predefined list of values. It has many filtering options and can be applied in different circumstances.
When working with enormous datasets, this strategy is especially helpful because it can drastically minimize the amount of data that needs to be processed. Data scientists and analysts can quickly filter out irrelevant data and concentrate exclusively on the information they require for their study by employing isin by exclusion.