0% found this document useful (0 votes)
47 views2 pages

BDDA Notes

This document discusses various techniques for filtering, grouping, and joining Spark DataFrames. It covers filtering with regular expressions and logical operators, grouping data and aggregating values, joining DataFrames on columns, and calculating null values in columns. Methods like select, filter, distinct, groupby, agg, and join are used to manipulate and combine DataFrames in various ways.

Uploaded by

Mohit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views2 pages

BDDA Notes

This document discusses various techniques for filtering, grouping, and joining Spark DataFrames. It covers filtering with regular expressions and logical operators, grouping data and aggregating values, joining DataFrames on columns, and calculating null values in columns. Methods like select, filter, distinct, groupby, agg, and join are used to manipulate and combine DataFrames in various ways.

Uploaded by

Mohit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

# 9.

0 Filtering with regular expressions # 11  Filtering logical operatotrs: &, '|', == # 10.3

# 9.1 Filter with regular expressions: # 11.1 We can filter our data based on multiple conditions airports_df.join(
airports_df.select('name'). \ (AND or OR)
where(" name rlike 'pal$' " ). \
show(3,truncate = False)                    weather_df,
# 9.2 where is an alias for filter #     Logical Operators: & ==and,   | == or   ~ == not
airports_df.select('name'). \                    airports_df.faa==weather_df.origin,
filter(" name rlike 'pal$' " ). \ airports_df.filter(                         \
show(3,truncate = False)                    how = 'left'
# 9.3 airports_df.filter(" name rlike 'pal$' " ). \                    (airports_df.tz == -5) &  \
show(3,truncate = False)
                ).count()   # Could also use 'left_outer', 'right', 'full'
## Use of Column.API
######################                    (airports_df.dst=="A") | \
# 8.7 Use of isin() function. # 10.4
# It is difficult to use isin() within                    (airports_df.name.like('%Lans%')) \
# a string-condition because of list-object weather_df.join(
# Syntax: Column.isin(*cols)                   ). show(3)  
airports_df.select("name").                    airports_df,
where(airports_df.name.isin(['Lansdowne Airport',
# 11.2 Conditions within strings:
'Randall Airport']))
# 8.8 Use of %like%                    airports_df.faa==weather_df.origin,
# Syntax: Column.like(other) airports_df.filter(                         \
# other: SQL like expression                    how = 'left'
airports_df.select(airports_df.columns[:2]). \                    "(tz == -5) AND  \
where("name like '%La%'"). \                 ).count()   # Could also use 'left_outer', 'right', 'full'
show(3)                     (dst== 'A') OR \
# 8.9 Note like() function
airports_df.select(airports_df.columns[:2]). \
where(airports_df.name.like('%La%')). \                     (name  like '%Lans%' )" \
show(3)
10. Combining verbs: select, filter and distinct                   ). show(3)
airports_df.select('dst', 'tz'). \
filter(airports_df.tz == -5). \ # 12. groupby. Can apply sum, min, max, count
show(3)
# 10.1
airports_df.select('dst', 'tz'). \ airports_df.groupby('tz'). \
filter(airports_df.tz == -5). \
distinct(). \            count(). \
show(3)
           show(3)

# 12.1

airports_df.groupby('tz'). \

            agg({'lat' : 'mean'}). \

            show(3)

# 12.2

airports_df.groupby(['tz','dst']). \

            agg({'lat' : 'mean'}). \

            show(3)

# 10. Joins

# 10.1

airports_df.join(                                       # Left dataset

                  weather_df,                           # Right dataset

                  airports_df.faa==weather_df.origin    # Join on

                ).show(3)

# 10.2

airports_df.join(

                  weather_df,

                  airports_df.faa==weather_df.origin,

                  how = 'inner'
# 9.0 Getting mode of a feature--step-by-step

# 6.0 Transforming filtered df.groupby('workclass').count().show(3)


data to numpy array
df.groupby('workclass').count().orderBy("count").show(3)
#     Maybe for plotting:
np.array(df.filter(df.age == df.groupby('workclass').count().orderBy("count",ascending = False).show(3)
21).take(2))
# OR full data df.groupby('workclass').count().orderBy("count",ascending = False).first() # Print first
Row
np.array(df.filter(df.age ==
21).collect())
# 6.1 Return a pandas
dataframe # 9.0.1 Row object:
abc = df.toPandas() df.groupby('workclass').count().orderBy("count",ascending = False).first()['workclass'] #
abc.head(2) Access values as dict values
# 7.0 Per column how many
null values: df.groupby('workclass').count().orderBy("count",ascending = False).first().workclass #
from pyspark.sql.functions Access values like attributes
import isnan, when, count, col
df.groupby('workclass').count().orderBy("count",ascending = False).first()[0] # Row
def null_values(data): object behaves as a dict

data.select([count(when(isnan(
c) | col(c).isNull(), c)).alias(c)
for c in data.columns]).show()
# 7.1 Use the function:
null_values(df)
# 7.2 Use where filter
df.select('*').where(df.income.i
sNull()).count()

You might also like