Pyspark Spark SQL: Scenario Based Interview
Pyspark Spark SQL: Scenario Based Interview
Pyspark vs
Spark SQL
Ganesh. R
Scenario: Create a SQL query that will produce an output with more records for open positions.
For example, if your employee name is "vacant" and you have vacant titles, add as many records
to the input as there are open postings.
###PySpark
from pyspark.sql.functions import col, lit, when
from pyspark.sql import Row
# Create DataFrame for job_employees
job_employees_df = spark.createDataFrame(job_employees_data,
schema=job_employees_columns)
# Create a DataFrame for all required rows (totalpost rows for each
job position)
expanded_positions = job_positions_df.rdd.flatMap(lambda row:
[Row(id=row['id'], title=row['title'], groups=row['groups'],
levels=row['levels'], payscale=row['payscale'],
totalpost=row['totalpost'], pos_num=i) for i in
range(row['totalpost'])]).toDF()
###SQL
job_positions_df.createOrReplaceTempView("job_positions")
job_employees_df.createOrReplaceTempView("job_employees")
%sql
with cte as(
select
name,
position_id,
row_number() over(
order by
a.id
) as rn
from
job_employees as a
join job_positions as b on a.position_id = b.id
),
jp as (
select
a.id,
a.title,
a.groups,
a.payscale,
a.levels,
b.rn
from
job_positions as a
join cte as b on b.rn <= a.totalpost
)
select
a.title,
a.groups,
a.payscale,
coalesce(b.name, 'Vacant')
from
jp as a
left join cte as b on b.rn = a.rn
and b.position_id = a.id
order by
a.id,
b.rn;
IF YOU FOUND
THIS POST
USEFUL, PLEASE
SAVE IT.
Ganesh. R
+91-9030485102. Hyderabad, Telangana. [email protected]
https://medium.com/@rganesh0203 https://rganesh203.github.io/Portfolio/
https://github.com/rganesh203. https://www.linkedin.com/in/r-ganesh-a86418155/
https://www.instagram.com/rg_data_talks/ https://topmate.io/ganesh_r0203