Pyspark Distinct and Filter
Pyspark Distinct and Filter
Here’s a structured set of notes with code to cover changing data types, filtering data, and
handling unique/distinct values in PySpark using the employee data:
In PySpark, you can change the data type of a column using the cast() method. This is
helpful when you need to convert data types for columns like Salary or Phone.
df.printSchema()
2. Filtering Data
You can filter rows based on specific conditions. For instance, to filter employees with a
salary greater than 50,000:
You can also apply multiple conditions using & or | (AND/OR) to filter data. For example,
finding employees over 30 years old and in the IT department:
Filtering based on whether a column has NULL values or not is crucial for data cleaning:
This set of operations will help you efficiently manage and transform your data in PySpark,
ensuring data integrity and accuracy for your analysis!
1. Changing Data Types: Easily modify column types using .cast(). E.g., change 'Salary' to
double or 'Phone' to string for better data handling.
2. Filtering Data: Use .filter() or .where() to extract specific rows. For example, filter
employees with a salary over 50,000 or non-null Age.
3. Multiple Conditions: Chain filters with & and | to apply complex conditions, such as
finding employees over 30 in the IT department.
4. Handling NULLs: Use .isNull() and .isNotNull() to filter rows with missing or available
values, such as missing addresses or valid emails.
5. Unique/Distinct Values: Use .distinct() to get unique rows or distinct values in a
column. Remove duplicates based on specific fields like Email or Phone using
.dropDuplicates().
6. Count Distinct Values: Count distinct values in one or multiple columns to analyze
data diversity, such as counting unique departments or combinations of Department
and Performance_Rating.