0% found this document useful (0 votes)

88 views13 pages

SA Coding Assessment

The document outlines a coding assessment for data engineering focused on airlines and baby names datasets. Participants are required to perform data cleansing, transformation, and answer various questions using DataFrame API, with an emphasis on the process rather than the results. The assessment includes both coding and written-answer questions, covering topics such as data loading, analysis, and performance considerations in Spark.

Uploaded by

Anusha Ragavendran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views13 pages

SA Coding Assessment

Uploaded by

Anusha Ragavendran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

SA coding assessment: Data

Engineering, Airlines
Version 2022.01
What you'll do:

 We provide the dataset. You will load it into dataframes, and perform
some data cleansing and transformation tasks.
 You will answer a series of questions to show insights from the data.
 There are also some written-answer questions.

We care about the process, not the result. I.e., we're looking for proper
use of data engineering techniques and understanding of the code you've
written.

This Data Engineering section is scored out of 55 points.

# This folder is for you to write any data as needed. Write access is restricted elsewhere. You can
always read from dbfs.

aws_role_id = "AROAUQVMTFU2DCVUR57M2"

user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')

userhome = f"s3a://e2-interview-user-data/home/{aws_role_id}:{user}"

print(userhome)

The following questions use the airlines dataset located at dbfs:/interview-

datasets/sa/airlines. All airlines questions expect answers that use the
Dataframes API (Scala or Python). SQL only answers are accepted but may
receive reduced points. We will not accept answers that use the RDD API.

Question : 1

Write code that uses the DataFrame API to read in the entire airlines data
set with clearly named columns.

# Please provide your code answer for Question 1 here.

# Reading Schema - While loading data from /airlines the header was not detected properly by
default with inferSchema & header switch so this workaround to fetch the schema

header = spark.read.csv("/interview-datasets/sa/airlines/part-00000", header="true")

airline_sch = header.schema

airlineDf = spark.read.csv("/interview-datasets/sa/airlines/", inferSchema="false", header="true",

schema=airline_sch)

airlineDf.cache()

display(airlineDf)

Written Answer

Explain your answer to Question 1 here.

I have used spark dataframe API method to read the airlines CSV file.

Since the part-00000 has the header record and while loading records in
parallel to dataframe the Header is not always the firect record so I used
an alternate step to read the schema from part-00000 and then use the
schema to load the dataset with named columns with swtich Header as
True to eliminate the header record from part-00000.

I have cached the filtered df (airlineDf) for frequent use in below questions
as a best practice and for better performance

Airlines Question 2 [5 Points]

How many unique airlines are present in this dataset?
# Please provide your code answer for Question 2 here

from pyspark.sql import functions as F

# Get distinct count of UniqueCarrier

unqCar = airlineDf.select(F.countDistinct("UniqueCarrier"))

display(unqCar)

Written Answer
 I am using Dataframe method Select and using countDistinct function
to find no. of unique UniqueCarrier from the filtered dataset
 Storing the output into new dataframe unqCar.

Airlines Question 3 [10 Points]

Which airline is delayed on departure most often? Show a bar graph of the
top five most delayed airlines.
# Please provide your code answer for Question 3

from pyspark.sql import functions as F

from pyspark.sql import types as T

airdf = airlineDf.withColumn("iDepDelay", F.col("DepDelay").cast(T.IntegerType()))

delayedAirline = airdf.filter(F.col("iDepDelay") >

0).groupBy("UniqueCarrier").agg(F.count("iDepDelay").alias("NumDelays")).sort(F.desc("NumDelays"
)).take(5)

display(delayedAirline)

Written Answer

I have followed below steps to derive the result:

 Filtered the airlines that has departure delay records by checking the
DepDelay field is greater than 0.
 Aggregating number of occurences an airline has delayed based on
the UniqueAirline field
 Sort NumDelays in descending order and take the first 5 items into
the result dataframe - delayedAirline
 Display using Bar chart for 5 carriers and respective delay occurences

Airlines Question 4 [15 Points]

Part a: What was the average arrival delay per airline?

Part b: Also, for each airline, on average did flights arrive early or late?

Calculate the average arrival delay per airline. Then, add a column to this
new dataframe (containing the grouped averages) that contains the string
"Late" if the average arrival for that airline arrive >15 minutes late,
"Early" if the average arrival is <0 minutes late, and "On-time" if the
average arrival is between 0 and 15 minutes late.
To add the additional column, use a Spark UDF. Additionally, make sure to
filter out or fill in null values in your dataframe (if there are any) prior to
applying the UDF.

# Please provide your code answer for Question 4

# Part a:

arrDelaydf = airlineDf.withColumn("iArrDelay", F.col("ArrDelay").cast(T.IntegerType()))

avgArrival = arrDelaydf.groupby("UniqueCarrier").agg(F.avg("iArrDelay").alias("AvgArrDelay"))

display(avgArrival)

# Part b

from pyspark.sql.functions import udf

@udf("string")

def groupedAvg_udf(a):

if a >= 15:

return 'Late'

elif a >= 0 and a < 15:

return 'On-time'

else:

return 'Early'

groupedArr = avgArrival.withColumn("groupedAvgArrival", groupedAvg_udf("AvgArrDelay"))

display(groupedArr)

Written Answer

Part: A

 Convert the string datatype field ArrDelay into Integer

 Calculate avg on Arrival Delay per UniqueCarrier

Part: B

 Declaring an UDF to take input as the Avg Arrival Delay and applying
the logic to find Late, On-time or Early
 Creating a new Dataframe and adding a new column using the UDF
created to return Grouped Avg Delay string

Airlines Question 5 [15 Points]

What file format is airlines data stored in, and was this the most optimal
format for the questions asked above?

What format would you store this data in if you frequently queried only
the UniqueCarr and CancellationCode columns?

What if you frequently read entire rows of the dataset?

Note: Cite any sources used. You do not need a code answer for this
question.

Written Answer

 File format was CSV; Looking at the data and operations performed I
think it would be good to store them in parquet file format
 If I am supposed to use only few columns for the analysis then I prefer
use parquet only since it is good while performing operations on
column level especially when the data volumn is huge
 If I am supposed to process entire row, then i prefer to choose ORC file
format since it gives better read performance on row level since the
data are stored/processed at stripe level and options to skip rows
based on the stats stored in the footer of each stripes.

I have written based on my experience but usually I refer Spark

documentation and databricks learning materials while researching for
better ways of designing solutions.

Airlines Question 6 [5 Points]

If you needed to keep multiple versions of this dataset, why might you use
the Delta format to do this efficiently?

Written Answer

Delta lake format provides versioning options either by version or by

timestamp which ease the complexity of managing multiple versions of a
dataset. Also, it provides reliable ACID transactions and better metadata
management options.
In this use case I would have used a single table to restore old data or
merge multiple datasets using delta format. It provides better cache
capabilities for frequently used queries. Converting from parquet or Delta
comes available out of the box from Databricks which simplifies the
funtional dependencies and common across languages.

Most of the questions were on single or few column based so I would use
Delta or Parquet format to store and process the data instead of from the
CSV format
SA coding assessment: Data
Engineering, Baby Names
Version 2022.01
What you'll do:

We care about the process, not the result. I.e., we're looking for proper
use of data engineering techniques and understanding of the code you've
written.

This Data Engineering section is scored out of 50 points.

Setup Env

1
# This folder is for you to write any data as needed. Write access is restricted elsewhere.
You can always read from dbfs.
2
aws_role_id = "AROAUQVMTFU2DCVUR57M2"
3
user =
dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
4
userhome = f"s3a://e2-interview-user-data/home/{aws_role_id}:{user}"
5
print(userhome)
s3a://e2-interview-user-data/home/
AROAUQVMTFU2DCVUR57M2:[email protected]

Using Baby Names Data Set

This dataset comes from a website referenced by Data.gov. It lists baby
names used in the state of NY from 2007 to 2018.
Run the following two cells to copy this file to a usable location.
%scala
2
import java.net.URL
3
import java.io.File
4
import org.apache.commons.io.FileUtils
5

6
val tmpFile = new File("/tmp/rows.json")
7
FileUtils.copyURLToFile(new
URL("https://health.data.ny.gov/api/views/jxy9-yhdk/rows.json?
accessType=DOWNLOAD"), tmpFile)
import java.net.URL import java.io.File import
org.apache.commons.io.FileUtils tmpFile: java.io.File = /tmp/rows.json

# https://docs.python.org/3/library/hashlib.html#blake2

from hashlib import blake2b

user =
dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().a
pply('user')

h = blake2b(digest_size=4)

h.update(user.encode("utf-8"))

display_name = "user_" + h.hexdigest()

print("Display Name: " + display_name)

dbutils.fs.cp('file:/tmp/rows.json', userhome + '/rows.json')

dbutils.fs.cp(userhome + '/rows.json'
,f"dbfs:/tmp/{display_name}/rows.json")

baby_names_path = f"dbfs:/tmp/{display_name}/rows.json"

print("Baby Names Path: " + baby_names_path)

dbutils.fs.head(baby_names_path)

# Ensure you use baby_names_path to answer the questions. A bug in

Spark 2.X will cause your read to fail if you read the file from userhome.

# Please note that dbfs:/tmp is cleaned up daily at 6AM pacific

Baby Names Question 1 - Nested Data [15 Points]

Use Spark SQL's native JSON support to read the baby names file into a
dataframe. Use this dataframe to create a temporary table containing all
the nested data columns ("sid", "id", "position", "created_at",
"created_meta", "updated_at", "updated_meta", "meta", "year",
"first_name", "county", "sex", "count") so that they can be queried using
SQL.
Hint: you can use dbutils.fs.head(baby_names_path) to take a look at the
dataset before reading it in.

Suggested Steps:

1. Read in the JSON data

2. Pull all columns in the nested data column to top level, following the schema
specified above. There are built-in Spark SQL functions that will accomplish
this.
3. Create a temp table from this expanded dataframe using
createOrReplaceTempView()

# Please provide your code answer for Question 1 here

# dbutils.fs.head(baby_names_path)

# Loading JSON file into Dataframe

from pyspark.sql.functions import explode

spark.read.json(baby_names_path,multiLine=True).select(explode("data").
alias("BN")).createOrReplaceTempView("NameView")
sqlContext.sql("SELECT BN[0] AS sid, BN[1] AS id, BN[2] AS position,
BN[3] AS created_at, BN[4] AS created_meta, BN[5] AS updated_at, BN[6]
AS updated_meta, BN[7] AS meta, BN[8] AS year, BN[9] AS first_name,
BN[10] AS country, BN[11] AS sex, BN[12] AS count FROM
NameView").createOrReplaceTempView("babyNames")

display(sqlContext.sql("SELECT * FROM babyNames LIMIT 10"))

Written Answer

1. I have used native spark JSON file read method and using explode
method to extract the "data" node and creating a temp view
NameView
2. Using the NameView, I have extracted nested column into named
temp view using createOrReplaceTempView()
3. Finally displaying sample 5 records for review

Baby Names Question 2 - Multiple Languages [10 Points]

Using the temp table you created in the question above, write a SQL
query that gives the most popular baby name for each year in the
dataset. Then, write the same query using either the Scala or Python
dataframe APIs.

Code Answer

# Please provide your code answer for Question 2 here. You will need
separate cells for your SQL answer and your Python or Scala answer.

# SQL Answer

sumBabyNames = spark.sql("SELECT year, first_name, SUM(count) AS

sum_name FROM babyNames GROUP BY year, first_name ORDER BY
first_name")

display(sumBabyNames)

sumBabyNames.createOrReplaceTempView("bNames")
popNames = spark.sql("SELECT year,first_name, sum_name FROM
(SELECT *, DENSE_RANK() OVER(PARTITION BY year ORDER BY sum_name
DESC) AS rank FROM bNames) WHERE rank=1 ORDER BY year")

display(popNames.show())

# Python code

from pyspark.sql.window import *

from pyspark.sql.functions import *

nameDf = spark.sql("select year, first_name, sum(count) as scount from babyNames group by year,
first_name")

popularName = nameDf.withColumn("rank",
dense_rank().over(Window.partitionBy("year").orderBy(desc("scount")) )).where(col("rank")==1).sel
ect('year','first_name','scount').orderBy("year").show()

Written Answer

Please provide your brief, written description of your code here.

Baby Names Question 3 - Performance [10 Points]

Are there any performance considerations when choosing a language API
(SQL vs Python vs Scala) in the context of Spark?

Are there any performance considerations when using different data

representations (RDD, Dataframe) in Spark? Please explain, and provide
references if possible. No code answer is required.

visitors_path = "/interview-datasets/sa/births/births-with-visitor-data.json"

## Hint: the code below will read in the downloaded JSON files. However, the xml column needs to
be given structure. Consider using a UDF.

# Read the births-with-visitor-data.json file into a dataframe and parse the nested XML fields into
columns and print the total record count

df = spark.read.option("inferSchema", True).json(visitors_path)
import xml.etree.ElementTree as ET

import pyspark.sql.functions as F

@F.udf('array<struct<id:string, age:string, sex:string>>')

def parse_visitors(s):

root = ET.fromstring(s)

return list(map(lambda x: x.attrib, root.findall('visitor')))

df.select("year","county","first_name",
F.explode(parse_visitors('visitors')).alias('visitors')).select('year','county','first_name','visitors.*').crea
teOrReplaceTempView("babyVisitors")

display(spark.sql("select count(*) from babyVisitors"))

bVisitors = spark.sql("select * from babyVisitors")

bVisitors.show(1)

bVisitors.printSchema()

## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.

# Find the county with the highest average number of visitors across all births in that county

highAvgVisitors = bVisitors.groupBy("county").agg(F.avg("id").alias("avgVis"))\

.select(F.max(F.struct("county", "avgVis")).alias("v")).select("v.*")

display(highAvgVisitors.show())

## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.

# Find the average visitor age for a birth in the county of KINGS

bVisitors.filter(col("county") ==
'KINGS').groupBy(F.upper("county")).agg(F.avg("age").alias("avgAge")).show()
## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.

# Find the most common birth visitor age in the county of KINGS

# When ignoring the inconsistent capitalized field county

mostComAge = spark.sql("SELECT county, age, count(*) AS comAge FROM babyVisitors WHERE

county = 'KINGS' \

GROUP BY county, age ORDER BY comAge DESC LIMIT 1")

display (mostComAge)

# When fixing the inconsistent capitalized field county by converting to UPPER case as standard

mostComAge = spark.sql("SELECT UPPER(county), age, count(*) AS comAge FROM babyVisitors

WHERE UPPER(county) = 'KINGS' \

GROUP BY UPPER(county), age ORDER BY comAge DESC LIMIT 1")

display (mostComAge)

#4 - Written Answer

In this baby visitor dataset the apply the below logic to find the most
common age in the county "KINGS" using Spark Sql a. Find count of Age
group by county and apply UPPER function to eliminate case sensitive
distribution of records b. Ordering the resultset in descending order and
pick the 1st element from the output c. Display the final result from the
dataframe

Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
SEP-OPE-OHSF1-SC03-00001 Ohaji South EPF Heater Treater Installation Project - FEED SOW - A01 - Signed
100% (1)
SEP-OPE-OHSF1-SC03-00001 Ohaji South EPF Heater Treater Installation Project - FEED SOW - A01 - Signed
36 pages
GRAUER &amp WEIL (INDIA) LTD PDF
100% (2)
GRAUER &amp WEIL (INDIA) LTD PDF
8 pages
Theories in Nursing Informatics
No ratings yet
Theories in Nursing Informatics
31 pages
Compare Colgate and Sensodyne Based On Positioning
No ratings yet
Compare Colgate and Sensodyne Based On Positioning
5 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
Two Stage Analysis Procedure
100% (1)
Two Stage Analysis Procedure
2 pages
Legend of Zelda Breath of The Wild - Revalis Theme Piano Cover
100% (2)
Legend of Zelda Breath of The Wild - Revalis Theme Piano Cover
2 pages
Mendel and Heredity Worksheet
No ratings yet
Mendel and Heredity Worksheet
11 pages
Interview
No ratings yet
Interview
2 pages
Current Affairs
No ratings yet
Current Affairs
3 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
9 pages
Institute of Public Relations, USA
No ratings yet
Institute of Public Relations, USA
9 pages
Anila 8611
No ratings yet
Anila 8611
18 pages
Their Lives Before The Throne S1PDF-1
No ratings yet
Their Lives Before The Throne S1PDF-1
652 pages
Pointers To Review On Mathematics
No ratings yet
Pointers To Review On Mathematics
3 pages
Type VR Vacuum Circuit Breaker Interruptor Automático Al Vacío Tipo VR Disjoncteur Sous Vide Type VR
No ratings yet
Type VR Vacuum Circuit Breaker Interruptor Automático Al Vacío Tipo VR Disjoncteur Sous Vide Type VR
113 pages
Pre Cal Circle
No ratings yet
Pre Cal Circle
16 pages
Salih GÖKMEN - 07.2021
No ratings yet
Salih GÖKMEN - 07.2021
114 pages
Dbms Codds Rules
No ratings yet
Dbms Codds Rules
2 pages
CH 3 Geo Drainage
No ratings yet
CH 3 Geo Drainage
29 pages
Experiment-1: Aim: Equipment Required
No ratings yet
Experiment-1: Aim: Equipment Required
17 pages
Skin Scarring
No ratings yet
Skin Scarring
29 pages
Chapter 2 Different Types of Fixtures
No ratings yet
Chapter 2 Different Types of Fixtures
20 pages
7 Counters PDF
No ratings yet
7 Counters PDF
13 pages
Sample Final Paper Quantitative
No ratings yet
Sample Final Paper Quantitative
48 pages
XII-SCIENCE-24-25-SUMMER HOLIDAY ASSIGNMENT - Removed
No ratings yet
XII-SCIENCE-24-25-SUMMER HOLIDAY ASSIGNMENT - Removed
14 pages
NCR GDCE Notification 2019 English 2019
No ratings yet
NCR GDCE Notification 2019 English 2019
10 pages
Conclusion
No ratings yet
Conclusion
14 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Python MCQs
No ratings yet
Python MCQs
21 pages
Httpssimplifydays.s3.Us West 2.amazonaws - Comsimplifybook Video4 PDF
No ratings yet
Httpssimplifydays.s3.Us West 2.amazonaws - Comsimplifybook Video4 PDF
7 pages
Ormulate The Data Science Problem
No ratings yet
Ormulate The Data Science Problem
5 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
BDA All 37 Practical Answers
No ratings yet
BDA All 37 Practical Answers
3 pages
DataGrokr Technical Assignment - Data Engineering
No ratings yet
DataGrokr Technical Assignment - Data Engineering
4 pages
English Notebook Face Sheet-21.04.25
No ratings yet
English Notebook Face Sheet-21.04.25
3 pages
Week 6 Assignment
No ratings yet
Week 6 Assignment
2 pages
09 Dcio Ib Fictfee Eng 30082024
No ratings yet
09 Dcio Ib Fictfee Eng 30082024
2 pages
MR-Pdt-SE New Adhesion Communication
No ratings yet
MR-Pdt-SE New Adhesion Communication
2 pages
Jin Memorial Temple
No ratings yet
Jin Memorial Temple
2 pages
Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
Cody's Data Cleaning Techniques Using SAS, Third Edition
From Everand
Cody's Data Cleaning Techniques Using SAS, Third Edition
Ron Cody
4.5/5 (3)
Python Reference: An Alphabetical Guide
From Everand
Python Reference: An Alphabetical Guide
Jo Foster
No ratings yet
IGNOU BCA Object-Oriented Technologies and Java Programming Previous Year Solved Papers MCS 024
From Everand
IGNOU BCA Object-Oriented Technologies and Java Programming Previous Year Solved Papers MCS 024
Manish Soni
No ratings yet
GETTING STARTED WITH SQL: Exercises with PhpMyAdmin and MySQL
From Everand
GETTING STARTED WITH SQL: Exercises with PhpMyAdmin and MySQL
Remy Lentzner
No ratings yet
Jump into JMP Scripting, Second Edition
From Everand
Jump into JMP Scripting, Second Edition
Wendy Murphrey
No ratings yet
MySQL for Python
From Everand
MySQL for Python
Albert Lukaszewski
5/5 (1)
Getting Started with SAS Programming: Using SAS Studio in the Cloud
From Everand
Getting Started with SAS Programming: Using SAS Studio in the Cloud
Ron Cody
No ratings yet
JavaScript Introduction
From Everand
JavaScript Introduction
Lisa Saldivar
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Microsoft Azure Data Engineer DP 203
From Everand
Microsoft Azure Data Engineer DP 203
Manish Soni
No ratings yet
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Hands-On Oracle Application Express Security: Building Secure Apex Applications
From Everand
Hands-On Oracle Application Express Security: Building Secure Apex Applications
Recx
No ratings yet
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
From Everand
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
Kim Chantala
No ratings yet
DBMS Lab Manual
From Everand
DBMS Lab Manual
Jitendra Patel
1.5/5 (3)
Java Programming Tutorial With Screen Shots & Many Code Example
From Everand
Java Programming Tutorial With Screen Shots & Many Code Example
Desmond Ohwofosirai
No ratings yet
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Windows Batch File Programming
From Everand
Windows Batch File Programming
Michael Elliott
2/5 (2)
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
From Everand
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
Manish Soni
No ratings yet
Core Java Programming Book
From Everand
Core Java Programming Book
Manish Soni
No ratings yet
Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
From Everand
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
5/5 (1)
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Introduction to PHP, Part 4, Second Edition
From Everand
Introduction to PHP, Part 4, Second Edition
Adam Majczak
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Intermediate Load Runner With Oracle/Apex Concepts.
From Everand
Intermediate Load Runner With Oracle/Apex Concepts.
Rohan Gordon
No ratings yet
MVS JCL Utilities Quick Reference, Third Edition
From Everand
MVS JCL Utilities Quick Reference, Third Edition
Robert Wingate
5/5 (1)
Salesforce Developer Interview Questions: 1.0, #1
From Everand
Salesforce Developer Interview Questions: 1.0, #1
SFDC TELUGU
No ratings yet
Oracle APEX Tips and Tricks
From Everand
Oracle APEX Tips and Tricks
Priyanka Agarwal
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Coding In C Decoded: Decoded, #1
From Everand
Coding In C Decoded: Decoded, #1
D Brown
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

SA Coding Assessment

Uploaded by

SA Coding Assessment

Uploaded by

SA coding assessment: Data

This Data Engineering section is scored out of 55 points.

The following questions use the airlines dataset located at dbfs:/interview-

# Please provide your code answer for Question 1 here.

header = spark.read.csv("/interview-datasets/sa/airlines/part-00000", header="true")

airlineDf = spark.read.csv("/interview-datasets/sa/airlines/", inferSchema="false", header="true",

Explain your answer to Question 1 here.

Airlines Question 2 [5 Points]

from pyspark.sql import functions as F

# Get distinct count of UniqueCarrier

Airlines Question 3 [10 Points]

from pyspark.sql import functions as F

from pyspark.sql import types as T

airdf = airlineDf.withColumn("iDepDelay", F.col("DepDelay").cast(T.IntegerType()))

delayedAirline = airdf.filter(F.col("iDepDelay") >

I have followed below steps to derive the result:

Airlines Question 4 [15 Points]

# Please provide your code answer for Question 4

arrDelaydf = airlineDf.withColumn("iArrDelay", F.col("ArrDelay").cast(T.IntegerType()))

from pyspark.sql.functions import udf

elif a >= 0 and a < 15:

groupedArr = avgArrival.withColumn("groupedAvgArrival", groupedAvg_udf("AvgArrDelay"))

 Convert the string datatype field ArrDelay into Integer

Airlines Question 5 [15 Points]

What if you frequently read entire rows of the dataset?

I have written based on my experience but usually I refer Spark

Airlines Question 6 [5 Points]

Delta lake format provides versioning options either by version or by

This Data Engineering section is scored out of 50 points.

Using Baby Names Data Set

from hashlib import blake2b

display_name = "user_" + h.hexdigest()

print("Display Name: " + display_name)

dbutils.fs.cp('file:/tmp/rows.json', userhome + '/rows.json')

print("Baby Names Path: " + baby_names_path)

# Ensure you use baby_names_path to answer the questions. A bug in

# Please note that dbfs:/tmp is cleaned up daily at 6AM pacific

Baby Names Question 1 - Nested Data [15 Points]

1. Read in the JSON data

# Please provide your code answer for Question 1 here

# Loading JSON file into Dataframe

from pyspark.sql.functions import explode

display(sqlContext.sql("SELECT * FROM babyNames LIMIT 10"))

Baby Names Question 2 - Multiple Languages [10 Points]

sumBabyNames = spark.sql("SELECT year, first_name, SUM(count) AS

from pyspark.sql.window import *

from pyspark.sql.functions import *

Please provide your brief, written description of your code here.

Baby Names Question 3 - Performance [10 Points]

Are there any performance considerations when using different data

@F.udf('array<struct<id:string, age:string, sex:string>>')

return list(map(lambda x: x.attrib, root.findall('visitor')))

display(spark.sql("select count(*) from babyVisitors"))

bVisitors = spark.sql("select * from babyVisitors")

# When ignoring the inconsistent capitalized field county

mostComAge = spark.sql("SELECT county, age, count(*) AS comAge FROM babyVisitors WHERE

GROUP BY county, age ORDER BY comAge DESC LIMIT 1")

mostComAge = spark.sql("SELECT UPPER(county), age, count(*) AS comAge FROM babyVisitors

GROUP BY UPPER(county), age ORDER BY comAge DESC LIMIT 1")

You might also like