SA Coding Assessment
SA Coding Assessment
Engineering, Airlines
Version 2022.01
What you'll do:
We provide the dataset. You will load it into dataframes, and perform
some data cleansing and transformation tasks.
You will answer a series of questions to show insights from the data.
There are also some written-answer questions.
We care about the process, not the result. I.e., we're looking for proper
use of data engineering techniques and understanding of the code you've
written.
# This folder is for you to write any data as needed. Write access is restricted elsewhere. You can
always read from dbfs.
aws_role_id = "AROAUQVMTFU2DCVUR57M2"
user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
userhome = f"s3a://e2-interview-user-data/home/{aws_role_id}:{user}"
print(userhome)
Question : 1
Write code that uses the DataFrame API to read in the entire airlines data
set with clearly named columns.
airline_sch = header.schema
airlineDf.cache()
display(airlineDf)
Written Answer
I have used spark dataframe API method to read the airlines CSV file.
Since the part-00000 has the header record and while loading records in
parallel to dataframe the Header is not always the firect record so I used
an alternate step to read the schema from part-00000 and then use the
schema to load the dataset with named columns with swtich Header as
True to eliminate the header record from part-00000.
I have cached the filtered df (airlineDf) for frequent use in below questions
as a best practice and for better performance
unqCar = airlineDf.select(F.countDistinct("UniqueCarrier"))
display(unqCar)
Written Answer
I am using Dataframe method Select and using countDistinct function
to find no. of unique UniqueCarrier from the filtered dataset
Storing the output into new dataframe unqCar.
display(delayedAirline)
Written Answer
Filtered the airlines that has departure delay records by checking the
DepDelay field is greater than 0.
Aggregating number of occurences an airline has delayed based on
the UniqueAirline field
Sort NumDelays in descending order and take the first 5 items into
the result dataframe - delayedAirline
Display using Bar chart for 5 carriers and respective delay occurences
Part b: Also, for each airline, on average did flights arrive early or late?
Calculate the average arrival delay per airline. Then, add a column to this
new dataframe (containing the grouped averages) that contains the string
"Late" if the average arrival for that airline arrive >15 minutes late,
"Early" if the average arrival is <0 minutes late, and "On-time" if the
average arrival is between 0 and 15 minutes late.
To add the additional column, use a Spark UDF. Additionally, make sure to
filter out or fill in null values in your dataframe (if there are any) prior to
applying the UDF.
# Part a:
avgArrival = arrDelaydf.groupby("UniqueCarrier").agg(F.avg("iArrDelay").alias("AvgArrDelay"))
display(avgArrival)
# Part b
@udf("string")
def groupedAvg_udf(a):
if a >= 15:
return 'Late'
return 'On-time'
else:
return 'Early'
display(groupedArr)
Written Answer
Part: A
Part: B
Declaring an UDF to take input as the Avg Arrival Delay and applying
the logic to find Late, On-time or Early
Creating a new Dataframe and adding a new column using the UDF
created to return Grouped Avg Delay string
What format would you store this data in if you frequently queried only
the UniqueCarr and CancellationCode columns?
Note: Cite any sources used. You do not need a code answer for this
question.
Written Answer
File format was CSV; Looking at the data and operations performed I
think it would be good to store them in parquet file format
If I am supposed to use only few columns for the analysis then I prefer
use parquet only since it is good while performing operations on
column level especially when the data volumn is huge
If I am supposed to process entire row, then i prefer to choose ORC file
format since it gives better read performance on row level since the
data are stored/processed at stripe level and options to skip rows
based on the stats stored in the footer of each stripes.
Written Answer
Most of the questions were on single or few column based so I would use
Delta or Parquet format to store and process the data instead of from the
CSV format
SA coding assessment: Data
Engineering, Baby Names
Version 2022.01
What you'll do:
We provide the dataset. You will load it into dataframes, and perform
some data cleansing and transformation tasks.
You will answer a series of questions to show insights from the data.
There are also some written-answer questions.
We care about the process, not the result. I.e., we're looking for proper
use of data engineering techniques and understanding of the code you've
written.
Setup Env
1
# This folder is for you to write any data as needed. Write access is restricted elsewhere.
You can always read from dbfs.
2
aws_role_id = "AROAUQVMTFU2DCVUR57M2"
3
user =
dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
4
userhome = f"s3a://e2-interview-user-data/home/{aws_role_id}:{user}"
5
print(userhome)
s3a://e2-interview-user-data/home/
AROAUQVMTFU2DCVUR57M2:[email protected]
6
val tmpFile = new File("/tmp/rows.json")
7
FileUtils.copyURLToFile(new
URL("https://health.data.ny.gov/api/views/jxy9-yhdk/rows.json?
accessType=DOWNLOAD"), tmpFile)
import java.net.URL import java.io.File import
org.apache.commons.io.FileUtils tmpFile: java.io.File = /tmp/rows.json
# https://docs.python.org/3/library/hashlib.html#blake2
user =
dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().a
pply('user')
h = blake2b(digest_size=4)
h.update(user.encode("utf-8"))
dbutils.fs.cp(userhome + '/rows.json'
,f"dbfs:/tmp/{display_name}/rows.json")
baby_names_path = f"dbfs:/tmp/{display_name}/rows.json"
Suggested Steps:
# dbutils.fs.head(baby_names_path)
spark.read.json(baby_names_path,multiLine=True).select(explode("data").
alias("BN")).createOrReplaceTempView("NameView")
sqlContext.sql("SELECT BN[0] AS sid, BN[1] AS id, BN[2] AS position,
BN[3] AS created_at, BN[4] AS created_meta, BN[5] AS updated_at, BN[6]
AS updated_meta, BN[7] AS meta, BN[8] AS year, BN[9] AS first_name,
BN[10] AS country, BN[11] AS sex, BN[12] AS count FROM
NameView").createOrReplaceTempView("babyNames")
Written Answer
1. I have used native spark JSON file read method and using explode
method to extract the "data" node and creating a temp view
NameView
2. Using the NameView, I have extracted nested column into named
temp view using createOrReplaceTempView()
3. Finally displaying sample 5 records for review
Code Answer
# Please provide your code answer for Question 2 here. You will need
separate cells for your SQL answer and your Python or Scala answer.
# SQL Answer
display(sumBabyNames)
sumBabyNames.createOrReplaceTempView("bNames")
popNames = spark.sql("SELECT year,first_name, sum_name FROM
(SELECT *, DENSE_RANK() OVER(PARTITION BY year ORDER BY sum_name
DESC) AS rank FROM bNames) WHERE rank=1 ORDER BY year")
display(popNames.show())
# Python code
nameDf = spark.sql("select year, first_name, sum(count) as scount from babyNames group by year,
first_name")
popularName = nameDf.withColumn("rank",
dense_rank().over(Window.partitionBy("year").orderBy(desc("scount")) )).where(col("rank")==1).sel
ect('year','first_name','scount').orderBy("year").show()
Written Answer
visitors_path = "/interview-datasets/sa/births/births-with-visitor-data.json"
## Hint: the code below will read in the downloaded JSON files. However, the xml column needs to
be given structure. Consider using a UDF.
# Read the births-with-visitor-data.json file into a dataframe and parse the nested XML fields into
columns and print the total record count
df = spark.read.option("inferSchema", True).json(visitors_path)
import xml.etree.ElementTree as ET
import pyspark.sql.functions as F
def parse_visitors(s):
root = ET.fromstring(s)
df.select("year","county","first_name",
F.explode(parse_visitors('visitors')).alias('visitors')).select('year','county','first_name','visitors.*').crea
teOrReplaceTempView("babyVisitors")
bVisitors.show(1)
bVisitors.printSchema()
## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.
# Find the county with the highest average number of visitors across all births in that county
highAvgVisitors = bVisitors.groupBy("county").agg(F.avg("id").alias("avgVis"))\
.select(F.max(F.struct("county", "avgVis")).alias("v")).select("v.*")
display(highAvgVisitors.show())
## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.
# Find the average visitor age for a birth in the county of KINGS
bVisitors.filter(col("county") ==
'KINGS').groupBy(F.upper("county")).agg(F.avg("age").alias("avgAge")).show()
## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.
# Find the most common birth visitor age in the county of KINGS
display (mostComAge)
# When fixing the inconsistent capitalized field county by converting to UPPER case as standard
display (mostComAge)
#4 - Written Answer
In this baby visitor dataset the apply the below logic to find the most
common age in the county "KINGS" using Spark Sql a. Find count of Age
group by county and apply UPPER function to eliminate case sensitive
distribution of records b. Ordering the resultset in descending order and
pick the 1st element from the output c. Display the final result from the
dataframe