BDDA Notes

This document discusses various techniques for filtering, grouping, and joining Spark DataFrames. It covers filtering with regular expressions and logical operators, grouping data and aggregating values, joining DataFrames on columns, and calculating null values in columns. Methods like select, filter, distinct, groupby, agg, and join are used to manipulate and combine DataFrames in various ways.

Uploaded by

Mohit Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views2 pages

BDDA Notes

Uploaded by

Mohit Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

# 9.

0 Filtering with regular expressions # 11 Filtering logical operatotrs: &, '|', == # 10.3

# 9.1 Filter with regular expressions: # 11.1 We can filter our data based on multiple conditions airports_df.join(
airports_df.select('name'). \ (AND or OR)
where(" name rlike 'pal$' " ). \
show(3,truncate = False) weather_df,
# 9.2 where is an alias for filter # Logical Operators: & ==and, | == or ~ == not
airports_df.select('name'). \ airports_df.faa==weather_df.origin,
filter(" name rlike 'pal$' " ). \ airports_df.filter( \
show(3,truncate = False) how = 'left'
# 9.3 airports_df.filter(" name rlike 'pal$' " ). \ (airports_df.tz == -5) & \
show(3,truncate = False)
).count() # Could also use 'left_outer', 'right', 'full'
## Use of Column.API
###################### (airports_df.dst=="A") | \
# 8.7 Use of isin() function. # 10.4
# It is difficult to use isin() within (airports_df.name.like('%Lans%')) \
# a string-condition because of list-object weather_df.join(
# Syntax: Column.isin(*cols) ). show(3)
airports_df.select("name"). airports_df,
where(airports_df.name.isin(['Lansdowne Airport',
# 11.2 Conditions within strings:
'Randall Airport']))
# 8.8 Use of %like% airports_df.faa==weather_df.origin,
# Syntax: Column.like(other) airports_df.filter( \
# other: SQL like expression how = 'left'
airports_df.select(airports_df.columns[:2]). \ "(tz == -5) AND \
where("name like '%La%'"). \ ).count() # Could also use 'left_outer', 'right', 'full'
show(3) (dst== 'A') OR \
# 8.9 Note like() function
airports_df.select(airports_df.columns[:2]). \
where(airports_df.name.like('%La%')). \ (name like '%Lans%' )" \
show(3)
10. Combining verbs: select, filter and distinct ). show(3)
airports_df.select('dst', 'tz'). \
filter(airports_df.tz == -5). \ # 12. groupby. Can apply sum, min, max, count
show(3)
# 10.1
airports_df.select('dst', 'tz'). \ airports_df.groupby('tz'). \
filter(airports_df.tz == -5). \
distinct(). \ count(). \
show(3)
show(3)

# 12.1

airports_df.groupby('tz'). \

agg({'lat' : 'mean'}). \

show(3)

# 12.2

airports_df.groupby(['tz','dst']). \

agg({'lat' : 'mean'}). \

show(3)

# 10. Joins

# 10.1

airports_df.join( # Left dataset

weather_df, # Right dataset

airports_df.faa==weather_df.origin # Join on

).show(3)

# 10.2

airports_df.join(

weather_df,

airports_df.faa==weather_df.origin,

how = 'inner'
# 9.0 Getting mode of a feature--step-by-step

# 6.0 Transforming filtered df.groupby('workclass').count().show(3)

data to numpy array
df.groupby('workclass').count().orderBy("count").show(3)
# Maybe for plotting:
np.array(df.filter(df.age == df.groupby('workclass').count().orderBy("count",ascending = False).show(3)
21).take(2))
# OR full data df.groupby('workclass').count().orderBy("count",ascending = False).first() # Print first
Row
np.array(df.filter(df.age ==
21).collect())
# 6.1 Return a pandas
dataframe # 9.0.1 Row object:
abc = df.toPandas() df.groupby('workclass').count().orderBy("count",ascending = False).first()['workclass'] #
abc.head(2) Access values as dict values
# 7.0 Per column how many
null values: df.groupby('workclass').count().orderBy("count",ascending = False).first().workclass #
from pyspark.sql.functions Access values like attributes
import isnan, when, count, col
df.groupby('workclass').count().orderBy("count",ascending = False).first()[0] # Row
def null_values(data): object behaves as a dict

data.select([count(when(isnan(
c) | col(c).isNull(), c)).alias(c)
for c in data.columns]).show()
# 7.1 Use the function:
null_values(df)
# 7.2 Use where filter
df.select('*').where(df.income.i
sNull()).count()

MY AMOS Command List
100% (10)
MY AMOS Command List
64 pages
Python For Data Science Cheat Sheet 2.0
100% (1)
Python For Data Science Cheat Sheet 2.0
11 pages
Ata 36
100% (1)
Ata 36
113 pages
Sans 10227
100% (4)
Sans 10227
15 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Python Cheat Sheet For Excel Users
No ratings yet
Python Cheat Sheet For Excel Users
5 pages
Python For Data Science Cheat Sheet 2.0
No ratings yet
Python For Data Science Cheat Sheet 2.0
11 pages
10 Min Pandas
No ratings yet
10 Min Pandas
18 pages
Dataframe in Pandas - Cheatsheet
No ratings yet
Dataframe in Pandas - Cheatsheet
8 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Pandas - Cheat - Sheet FULL
No ratings yet
Pandas - Cheat - Sheet FULL
2 pages
Cat - D8T Dozer Specs, Videos & 360 Views - D8 Dozer - Caterpillar
No ratings yet
Cat - D8T Dozer Specs, Videos & 360 Views - D8 Dozer - Caterpillar
17 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Pandas Assignment 1
No ratings yet
Pandas Assignment 1
7 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Creating Dataframes Reshaping Data
100% (1)
Creating Dataframes Reshaping Data
2 pages
Methods of Training Needs Identification
No ratings yet
Methods of Training Needs Identification
32 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
17 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
60 pages
Analyzing Data Using Python Filtering Data in Pandas
No ratings yet
Analyzing Data Using Python Filtering Data in Pandas
52 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
41b Data Wrangling, Grouping and Aggregation
No ratings yet
41b Data Wrangling, Grouping and Aggregation
31 pages
Nonlinear Inversion Flight Control For A Supermaneuverable Aircraft
100% (1)
Nonlinear Inversion Flight Control For A Supermaneuverable Aircraft
9 pages
Web Programming With Python and Javascript
No ratings yet
Web Programming With Python and Javascript
40 pages
Unit 1 Python Pandas
No ratings yet
Unit 1 Python Pandas
20 pages
Week7 Slides
No ratings yet
Week7 Slides
38 pages
Screenshot 2023-12-27 at 7.05.37 PM
No ratings yet
Screenshot 2023-12-27 at 7.05.37 PM
23 pages
Uob Python Lecture2p
No ratings yet
Uob Python Lecture2p
22 pages
Pandas Filtering
No ratings yet
Pandas Filtering
19 pages
Practical File IP
No ratings yet
Practical File IP
27 pages
012 Cleanliness
No ratings yet
012 Cleanliness
34 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
No ratings yet
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
12 pages
Introduction To Dplyr
No ratings yet
Introduction To Dplyr
9 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
File Ip
No ratings yet
File Ip
6 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
Lab Session 06: Perform Following Operations Using Pandas Lab Session 06: Perform Following Operations Using Pandas
No ratings yet
Lab Session 06: Perform Following Operations Using Pandas Lab Session 06: Perform Following Operations Using Pandas
5 pages
Ebook Comandos JesusG 1741221641
No ratings yet
Ebook Comandos JesusG 1741221641
7 pages
IP Imp Notes
No ratings yet
IP Imp Notes
5 pages
Different Methods of Plotting
No ratings yet
Different Methods of Plotting
4 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Lab Session 06: Perform Following Operations Using Pandas
No ratings yet
Lab Session 06: Perform Following Operations Using Pandas
5 pages
Pandas Lab Assignment Work-2
No ratings yet
Pandas Lab Assignment Work-2
5 pages
L-3 (Data Frame Part 2) .Ipynb - Colab
No ratings yet
L-3 (Data Frame Part 2) .Ipynb - Colab
5 pages
Manipulating Dataframes - Beginner
No ratings yet
Manipulating Dataframes - Beginner
2 pages
Hibhi
No ratings yet
Hibhi
3 pages
F 305 Final Bill Checklist
No ratings yet
F 305 Final Bill Checklist
2 pages
Cima f7 dvanced-Financial-Reporting PDF
100% (1)
Cima f7 dvanced-Financial-Reporting PDF
590 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
International Marketing Planning and Control
No ratings yet
International Marketing Planning and Control
8 pages
Experiment 1
No ratings yet
Experiment 1
3 pages
DIST88FNL
No ratings yet
DIST88FNL
37 pages
2 Plugins Changelog
No ratings yet
2 Plugins Changelog
3 pages
Executing Stored Procedures: Module Overview
No ratings yet
Executing Stored Procedures: Module Overview
20 pages
Chavez vs. CA
No ratings yet
Chavez vs. CA
1 page
1 Introduction To Environmental Science
No ratings yet
1 Introduction To Environmental Science
16 pages
FD Revised 5 - Asf Devastation and Financial Performance of Pork Suppliers in Davao City
No ratings yet
FD Revised 5 - Asf Devastation and Financial Performance of Pork Suppliers in Davao City
53 pages
FEA Questions
No ratings yet
FEA Questions
9 pages
Advanced Photonics Research - 2021 - Wu - High Resolution 960 540 and 1920 1080 UV Micro Light Emitting Diode Displays
No ratings yet
Advanced Photonics Research - 2021 - Wu - High Resolution 960 540 and 1920 1080 UV Micro Light Emitting Diode Displays
8 pages
02 Activity 1 READING WRITING
No ratings yet
02 Activity 1 READING WRITING
5 pages
Air21 Location
No ratings yet
Air21 Location
1 page
ISO 14001 Environment Management Watermark
No ratings yet
ISO 14001 Environment Management Watermark
2 pages
Journal of Accounting and Economics: Shuping Chen, Ying Huang, Ningzhong Li, Terry Shevlin T
No ratings yet
Journal of Accounting and Economics: Shuping Chen, Ying Huang, Ningzhong Li, Terry Shevlin T
19 pages
Assignment - Acc 221-2023
No ratings yet
Assignment - Acc 221-2023
3 pages
State Space Control of Systems Tutorial
No ratings yet
State Space Control of Systems Tutorial
15 pages
Introduction To The USA and Canada
No ratings yet
Introduction To The USA and Canada
10 pages
123GL Undstd Cybersec
No ratings yet
123GL Undstd Cybersec
6 pages
GBC - Group Contract Assignment Guidelines and Rubric 2023 3
No ratings yet
GBC - Group Contract Assignment Guidelines and Rubric 2023 3
4 pages
Labour Policy Snapshot
No ratings yet
Labour Policy Snapshot
1 page
BBS Implementation Process - Matrix
No ratings yet
BBS Implementation Process - Matrix
2 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Designing XSD diagrams vol1
From Everand
Designing XSD diagrams vol1
Jose Luis Arias Cobreros
No ratings yet
Java Programming Tutorial With Screen Shots & Many Code Example
From Everand
Java Programming Tutorial With Screen Shots & Many Code Example
Desmond Ohwofosirai
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
C Language Programming Codes
From Everand
C Language Programming Codes
Durgesh
No ratings yet

BDDA Notes

Uploaded by

BDDA Notes

Uploaded by

# 9.

airports_df.join( # Left dataset

weather_df, # Right dataset

# 6.0 Transforming filtered df.groupby('workclass').count().show(3)

You might also like