0% found this document useful (0 votes)

18 views3 pages

Pyspark Distinct and Filter

Uploaded by

Atul Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views3 pages

Pyspark Distinct and Filter

Uploaded by

Atul Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Master PySpark: From Zero to Big Data Hero!!

Here’s a structured set of notes with code to cover changing data types, filtering data, and
handling unique/distinct values in PySpark using the employee data:

1. Changing Data Types (Schema Transformation)

In PySpark, you can change the data type of a column using the cast() method. This is
helpful when you need to convert data types for columns like Salary or Phone.

from pyspark.sql.functions import col

# Change the 'Salary' column from integer to double

df = df.withColumn("Salary", col("Salary").cast("double"))

# Convert 'Phone' column to string

df = df.withColumn("Phone", col("Phone").cast("string"))

df.printSchema()

2. Filtering Data

You can filter rows based on specific conditions. For instance, to filter employees with a
salary greater than 50,000:

# Filter rows where Salary is greater than 50,000

filtered_df = df.filter(col("Salary") > 50000)
filtered_df.show()

# Filtering rows where Age is not null

filtered_df = df.filter(df["Age"].isNotNull())
filtered_df.show()

3. Multiple Filters (Chaining Conditions)

You can also apply multiple conditions using & or | (AND/OR) to filter data. For example,
finding employees over 30 years old and in the IT department:

Follow me on LinkedIn – Shivakiran kotur

# Filter rows where Age > 30 and Department is 'IT'
filtered_df = df.filter((df["Age"] > 30) & (df["Department"] ==
"IT"))
filtered_df.show()

4. Filtering on Null or Non-Null Values

Filtering based on whether a column has NULL values or not is crucial for data cleaning:

# Filter rows where 'Address' is NULL

filtered_df = df.filter(df["Address"].isNull())
filtered_df.show()

# Filter rows where 'Email' is NOT NULL

filtered_df = df.filter(df["Email"].isNotNull())
filtered_df.show()

5. Handling Unique or Distinct Data

To get distinct rows or unique values from your dataset:

# Get distinct rows from the entire DataFrame

unique_df = df.distinct()
unique_df.show()

# Get distinct values from the 'Department' column

unique_departments_df = df.select("Department").distinct()
unique_departments_df.show()

To remove duplicates based on specific columns, such as Email or Phone, use

dropDuplicates():

# Remove duplicates based on 'Email' column

unique_df = df.dropDuplicates(["Email"])
unique_df.show()

# Remove duplicates based on both 'Phone' and 'Email'

unique_df = df.dropDuplicates(["Phone", "Email"])
unique_df.show()

Follow me on LinkedIn – Shivakiran kotur

6. Counting Distinct Values

You can count distinct values in a particular column, or combinations of columns:

# Count distinct values in the 'Department' column

distinct_count_department =
df.select("Department").distinct().count()
print("Distinct Department Count:", distinct_count_department)

# Count distinct combinations of 'Department' and

'Performance_Rating'
distinct_combinations_count = df.select("Department",
"Performance_Rating").distinct().count()
print("Distinct Department and Performance Rating Combinations:",
distinct_combinations_count)

This set of operations will help you efficiently manage and transform your data in PySpark,
ensuring data integrity and accuracy for your analysis!

Mastering PySpark DataFrame Operations

1. Changing Data Types: Easily modify column types using .cast(). E.g., change 'Salary' to
double or 'Phone' to string for better data handling.
2. Filtering Data: Use .filter() or .where() to extract specific rows. For example, filter
employees with a salary over 50,000 or non-null Age.
3. Multiple Conditions: Chain filters with & and | to apply complex conditions, such as
finding employees over 30 in the IT department.
4. Handling NULLs: Use .isNull() and .isNotNull() to filter rows with missing or available
values, such as missing addresses or valid emails.
5. Unique/Distinct Values: Use .distinct() to get unique rows or distinct values in a
column. Remove duplicates based on specific fields like Email or Phone using
.dropDuplicates().
6. Count Distinct Values: Count distinct values in one or multiple columns to analyze
data diversity, such as counting unique departments or combinations of Department
and Performance_Rating.

Follow me on LinkedIn – Shivakiran kotur

Quantiphi Interview
No ratings yet
Quantiphi Interview
2 pages
PySpark Slides
No ratings yet
PySpark Slides
30 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Scenarios Where Bad Records Occur
No ratings yet
Scenarios Where Bad Records Occur
38 pages
Journal
No ratings yet
Journal
47 pages
Pyspark 500
No ratings yet
Pyspark 500
103 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Techniquesfor Ensuring Data Quality
No ratings yet
Techniquesfor Ensuring Data Quality
19 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
G120C - Lista de Parametros
No ratings yet
G120C - Lista de Parametros
484 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Unit 2
100% (1)
Unit 2
18 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
HTML Code
No ratings yet
HTML Code
4 pages
Py Spark 1
No ratings yet
Py Spark 1
11 pages
Data Handling
No ratings yet
Data Handling
31 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
More on C# in Front Office
From Everand
More on C# in Front Office
Xing Zhou
No ratings yet
Data Cleaning Cheat Sheet
No ratings yet
Data Cleaning Cheat Sheet
2 pages
Chapter 1 Functional Programming
No ratings yet
Chapter 1 Functional Programming
47 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Day 77
No ratings yet
Day 77
10 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Pps Unit 2
No ratings yet
Pps Unit 2
189 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Python Data Cleaning
100% (1)
Python Data Cleaning
20 pages
Databricks Vs SQL Cheat Sheet
No ratings yet
Databricks Vs SQL Cheat Sheet
11 pages
HTML Code
No ratings yet
HTML Code
3 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Pyspark 12 Questions
No ratings yet
Pyspark 12 Questions
8 pages
Matrix Req To Tests
No ratings yet
Matrix Req To Tests
110 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Py Spark
No ratings yet
Py Spark
7 pages
Core Java: Why Java Is Important To Internet?
No ratings yet
Core Java: Why Java Is Important To Internet?
67 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
Bala Gurusamy
No ratings yet
Bala Gurusamy
234 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
CS Curriculum 2025-26 XI
No ratings yet
CS Curriculum 2025-26 XI
7 pages
Py Spark
No ratings yet
Py Spark
8 pages
Activity 2 Flowchart and Excel VBA - Linear Interpolation Extrapolation PDF
No ratings yet
Activity 2 Flowchart and Excel VBA - Linear Interpolation Extrapolation PDF
38 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
PPM in SES
No ratings yet
PPM in SES
128 pages
MP Language
No ratings yet
MP Language
18 pages
Updated New PRPC Cert Exam v5.5 SJB
No ratings yet
Updated New PRPC Cert Exam v5.5 SJB
85 pages
02 C Basics PDF
No ratings yet
02 C Basics PDF
43 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
14th EuroAd Workshop - Markus Towara - An Effective Discrete Adjoint Model For OpenFOAM PDF
No ratings yet
14th EuroAd Workshop - Markus Towara - An Effective Discrete Adjoint Model For OpenFOAM PDF
35 pages
Standard Template Library - HackerEarth
No ratings yet
Standard Template Library - HackerEarth
24 pages
Handbook For Object-Oriented Technology in Aviation (Ootia) : Volume 1: Handbook Overview January 30, 2004
No ratings yet
Handbook For Object-Oriented Technology in Aviation (Ootia) : Volume 1: Handbook Overview January 30, 2004
34 pages
Phoenix Contact Webvisit Manual
No ratings yet
Phoenix Contact Webvisit Manual
68 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
National Exit Exam Term 6
No ratings yet
National Exit Exam Term 6
4 pages
Java Total Notes BSC V Sem Paper V
No ratings yet
Java Total Notes BSC V Sem Paper V
103 pages
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Pointers Structures PDF
No ratings yet
Pointers Structures PDF
26 pages
DBMS Lab Manual
From Everand
DBMS Lab Manual
Jitendra Patel
1.5/5 (3)
Data Dictionary in Sap Abap
No ratings yet
Data Dictionary in Sap Abap
85 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
1 Introduction
No ratings yet
1 Introduction
5 pages
Syllabus EOY Examination 2021 Grade X
No ratings yet
Syllabus EOY Examination 2021 Grade X
2 pages
C Important Questions PDF
100% (3)
C Important Questions PDF
9 pages
This Example Shows Default Values of 8 Primitive Types in Java
No ratings yet
This Example Shows Default Values of 8 Primitive Types in Java
5 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
Array: PPT CREATED BY: Prof. Rajesh K. Jha
No ratings yet
Array: PPT CREATED BY: Prof. Rajesh K. Jha
17 pages

Pyspark Distinct and Filter

Uploaded by

Pyspark Distinct and Filter

Uploaded by

Master PySpark: From Zero to Big Data Hero!!

1. Changing Data Types (Schema Transformation)

from pyspark.sql.functions import col

# Change the 'Salary' column from integer to double

# Convert 'Phone' column to string

# Filter rows where Salary is greater than 50,000

# Filtering rows where Age is not null

3. Multiple Filters (Chaining Conditions)

Follow me on LinkedIn – Shivakiran kotur

4. Filtering on Null or Non-Null Values

# Filter rows where 'Address' is NULL

# Filter rows where 'Email' is NOT NULL

5. Handling Unique or Distinct Data

To get distinct rows or unique values from your dataset:

# Get distinct rows from the entire DataFrame

# Get distinct values from the 'Department' column

To remove duplicates based on specific columns, such as Email or Phone, use

# Remove duplicates based on 'Email' column

# Remove duplicates based on both 'Phone' and 'Email'

Follow me on LinkedIn – Shivakiran kotur

You can count distinct values in a particular column, or combinations of columns:

# Count distinct values in the 'Department' column

# Count distinct combinations of 'Department' and

Mastering PySpark DataFrame Operations

Follow me on LinkedIn – Shivakiran kotur

You might also like