0% found this document useful (0 votes)

27 views

Data Engineering - Solutions

This document provides solutions to 6 problems focused on data engineering with Spark SQL and Hadoop. Each problem loads data from source files, performs data transformations, and saves the results to destination files using different file formats like Parquet, JSON, ORC, text, and tables. The solutions are checked by loading the saved data and verifying record counts. Additional exam prep problems are recommended for further practice with data engineering concepts.

Uploaded by

dgnovo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Data Engineering - Solutions

Uploaded by

dgnovo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Spark SQL & Hadoop Course

(For Data Scientists & Big Data

Analysts)
Data Engineering
Solutions to Problems

PROBLEM 1 - SOLUTION

/// Read in data & create a DataFrame

var df_q1 = spark.read.format("csv")

.option("header", "true")
.load("/user/verulam_blue/data/WHO_data/population_data.csv.bz2")

/// Write code & save results

var col_names = df_q1.columns

df_q1
.na.fill("NIL", col_names)
.write.format("avro")
.option("compression", "snappy")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem01/")

/// Check Results

var check_1 = spark.read.format("avro")

.option("header", "true")
.option("compression", "snappy")
.load("/user/vb_student/problems/section07/problem01/")

check_1.show(3, false)
check_1.count()

Should be: 8665

Data Engineering - Solutions to Problems P a g e 1|7

PROBLEM 2 - SOLUTION

/// Read in data & create a DataFrame

var df_q2 = spark.read.format("parquet")

.option("compression", "gzip")
.load("/user/verulam_blue/data/gp_db/practice_demographics/")

/// Write code & save results

df_q2
.filter("nbr_of_patients > 3000 AND nbr_of_patients < 4000")
.selectExpr("practice_code", "nbr_of_patients")
.write.format("json")
.option("compression", "deflate")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem02")

/// Check Results

var check_2 = spark.read.format("json")

.option("compression", "deflate")
.load("/user/vb_student/problems/section07/problem02")

check_2.show(3, false)
check_2.count()

Should be: 806

Data Engineering - Solutions to Problems P a g e 2|7

PROBLEM 3 - SOLUTION

/// Read in data & create a DataFrame

var df_q3 = spark.read.format("parquet")

.option("compression", "gzip")
.load("/user/verulam_blue/data/credit_cards")

/// Write code & save results

df_q3
.select("card_holder_name", "issuing_bank", "issue_date")
.write.format("orc")
.option("compression", "zlib")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem03")

/// Check Results

var check_3 = spark.read.format("orc")

.option("compression", "zlib")
.load("/user/vb_student/problems/section07/problem03/")

check_3.show(3, false)
check_3.count()

Should be:2000000

Data Engineering - Solutions to Problems P a g e 3|7

PROBLEM 4 - SOLUTION

/// Read in data & create a DataFrame

var df_q4 = spark.read.format("parquet")

.option("compression", "gzip")
.option("header", "true")
.load("/user/verulam_blue/data/gp_db/gp_rx")

/// Write code & save results

df_q4
.selectExpr("concat_ws('\t', sha, pct, practice_code, bnf_code, bnf_name, items, nic,
act_cost, quantity, period) as results")
.write.format("text")
.option("compression", "lz4")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem04")

/// Check Results

var check_4 = spark.read.format("csv")

.option("sep", "\t")
.option("compression", "lz4")
.load("/user/vb_student/problems/section07/problem04/")

check_4.show(3, false)
check_4.count()

Should be: 10272116

Data Engineering - Solutions to Problems P a g e 4|7

PROBLEM 5 - SOLUTION

/// Read in data & create a DataFrame

var df_q5 = spark.read.format("parquet")

.option("compression", "gzip")
.load("/user/verulam_blue/data/taxi_data")

/// Write code & save results

df_q5
.where(month($"pickup_datetime") === "03")
.selectExpr("concat_ws('|',*)")
.write.format("text")
.option("compression", "gzip")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem05")

/// Check Results

var check_5 = spark.read.format("csv")

.option("sep", "|")
.option("compression", "gzip")
.load("/user/vb_student/problems/section07/problem05/")

check_5.show(3, false)
check_5.count()

Should be: 834429

Data Engineering - Solutions to Problems P a g e 5|7

PROBLEM 6 - SOLUTION

/// Read in data & create a DataFrame

var df_q6 = spark.read.format("parquet")

.option("compression", "gzip")
.load("/user/verulam_blue/data/gp_db/gp_rx/")

/// Write code & save results

df_q6
.selectExpr("practice_code", "bnf_code", "bnf_name", "items", "nic", "act_cost",
"abs(nic-act_cost) as difference")
.where($"difference" > 2)
.drop("difference")
.coalesce(1)
.write.format("parquet")
.option("compression", "gzip")
.option("path", "/user/vb_student/problems/section07/problem06/")
.mode("append")
.saveAsTable("gp_db.q6_soln")

/// Check Results

var check_6 = spark.sql("SELECT * FROM gp_db.q6_soln")

check_6.show(3,false)
check_6.count()

Should be: 4736749

End of Solutions to Problems

Data Engineering - Solutions to Problems P a g e 6|7
For more problems that focus on the “Data Engineering” section of this course
see the course:

CCA175 Exam Prep Questions Part A ETL Focus (With Spark 2.4 Hadoop Cluster
VM)

See the Bonus Section for more details.

Data Engineering - Solutions to Problems P a g e 7|7

Operational Analytics With ODP - Monitoring Delta Queues at ODQMON
No ratings yet
Operational Analytics With ODP - Monitoring Delta Queues at ODQMON
11 pages
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
Analyze Ab Test Results
No ratings yet
Analyze Ab Test Results
17 pages
SSCE_Practical_Solutions
No ratings yet
SSCE_Practical_Solutions
4 pages
Pyspark coding questions from StrataScratch platform
No ratings yet
Pyspark coding questions from StrataScratch platform
23 pages
Spark Python Course APPLY Project Solution Guide Hints
No ratings yet
Spark Python Course APPLY Project Solution Guide Hints
2 pages
Spark Notes 1650350323
No ratings yet
Spark Notes 1650350323
2 pages
IT7C4 IR December 2019
No ratings yet
IT7C4 IR December 2019
3 pages
External Practical Solutions 2024-2025
No ratings yet
External Practical Solutions 2024-2025
2 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Python Assignment
No ratings yet
Python Assignment
2 pages
Lab2.2 Kritika
No ratings yet
Lab2.2 Kritika
10 pages
Customer Data Outliers Pyspark
No ratings yet
Customer Data Outliers Pyspark
1 page
Pyspark Hands on
No ratings yet
Pyspark Hands on
189 pages
IP-065-PRACTICALS 2025-with answer
No ratings yet
IP-065-PRACTICALS 2025-with answer
3 pages
UEC735
No ratings yet
UEC735
2 pages
1
No ratings yet
1
12 pages
Data Analysis by Using Python
No ratings yet
Data Analysis by Using Python
15 pages
Complete Practical File of Class XII-IP 2020-21
No ratings yet
Complete Practical File of Class XII-IP 2020-21
80 pages
Public Policy Report: UC Student Default Python Model: The Council For Education (CED)
No ratings yet
Public Policy Report: UC Student Default Python Model: The Council For Education (CED)
20 pages
12 IP MODEL TO PRINT
No ratings yet
12 IP MODEL TO PRINT
7 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Class 12 IP Practical TASK
No ratings yet
Class 12 IP Practical TASK
3 pages
Spark Best Practices
No ratings yet
Spark Best Practices
10 pages
Practical Questions
No ratings yet
Practical Questions
7 pages
Bda Solved Sample Question Paper 70 Marks
No ratings yet
Bda Solved Sample Question Paper 70 Marks
29 pages
2024(6)
No ratings yet
2024(6)
3 pages
Class 12 Practical File Informatics Practices (Laxmi Yadav)
No ratings yet
Class 12 Practical File Informatics Practices (Laxmi Yadav)
26 pages
Apache Spark - Practices
No ratings yet
Apache Spark - Practices
24 pages
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
No ratings yet
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
7 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
Python - Final 1
No ratings yet
Python - Final 1
17 pages
Bda Toppers Solution
No ratings yet
Bda Toppers Solution
71 pages
T I M e S T A M P G R o U P L A N D I N G - P A G e C o N V e R T e D
No ratings yet
T I M e S T A M P G R o U P L A N D I N G - P A G e C o N V e R T e D
6 pages
Adobe Scan 25 Nov 2023
No ratings yet
Adobe Scan 25 Nov 2023
17 pages
UEC735
No ratings yet
UEC735
2 pages
ML lab manual 1-10
No ratings yet
ML lab manual 1-10
58 pages
Assignment 02
No ratings yet
Assignment 02
13 pages
Ip Project (2) - Merged
No ratings yet
Ip Project (2) - Merged
28 pages
Int 421
No ratings yet
Int 421
2 pages
Ramadan Bundle Offer All Course Module
No ratings yet
Ramadan Bundle Offer All Course Module
13 pages
Ip Practical File Class 12
No ratings yet
Ip Practical File Class 12
57 pages
All India Senior Secondary Certificate Examination
No ratings yet
All India Senior Secondary Certificate Examination
12 pages
Exploring Microsoft Office 2013 Volume 2 1st Edition Poatsy Solutions Manual pdf download
100% (2)
Exploring Microsoft Office 2013 Volume 2 1st Edition Poatsy Solutions Manual pdf download
48 pages
1st question
No ratings yet
1st question
34 pages
nRQgi8EgDUNFS451K4xQXA
No ratings yet
nRQgi8EgDUNFS451K4xQXA
61 pages
UEC718
No ratings yet
UEC718
2 pages
practical file class xii
No ratings yet
practical file class xii
25 pages
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
S7 Practice Questions
No ratings yet
S7 Practice Questions
7 pages
IA2 Scheme of Valuation
No ratings yet
IA2 Scheme of Valuation
6 pages
Exploring Microsoft Office 2013 Volume 2 1st Edition Poatsy Solutions Manual pdf download
100% (2)
Exploring Microsoft Office 2013 Volume 2 1st Edition Poatsy Solutions Manual pdf download
41 pages
PRACTICAL EXAMINATION SAMPLE PAPER
No ratings yet
PRACTICAL EXAMINATION SAMPLE PAPER
4 pages
Dwgeek Com Bigquery Recursive Query Alternative Example HTML
No ratings yet
Dwgeek Com Bigquery Recursive Query Alternative Example HTML
10 pages
Preprocessing code
No ratings yet
Preprocessing code
11 pages
Matplotlib linechatsy
No ratings yet
Matplotlib linechatsy
38 pages
Spark and Scala 2
No ratings yet
Spark and Scala 2
11 pages
De Interview Raamashaamy Qna Bank
No ratings yet
De Interview Raamashaamy Qna Bank
11 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
ms_csxii_pb1_set3
No ratings yet
ms_csxii_pb1_set3
7 pages
students-exam-scores-analysis.ipynb
No ratings yet
students-exam-scores-analysis.ipynb
4 pages
The Charisma DeMyth PDF
No ratings yet
The Charisma DeMyth PDF
1 page
Coupled Tanks
No ratings yet
Coupled Tanks
32 pages
2021 GlobalPaymentGuide August 2021 GLOBAL
No ratings yet
2021 GlobalPaymentGuide August 2021 GLOBAL
83 pages
The Effects of Modern Methods On The Stability of Achievement in Physics of Yefren-Libya Primary School
No ratings yet
The Effects of Modern Methods On The Stability of Achievement in Physics of Yefren-Libya Primary School
4 pages
Internet Buzz Fashion Illustration Inspiration and Technique Instant DOCX Download
No ratings yet
Internet Buzz Fashion Illustration Inspiration and Technique Instant DOCX Download
21 pages
A Study On Employee's Performance Appraisal Towards Sri Ramco Spinners at Rajapalayam
100% (1)
A Study On Employee's Performance Appraisal Towards Sri Ramco Spinners at Rajapalayam
81 pages
Ashrae RP-1093
No ratings yet
Ashrae RP-1093
92 pages
Hex Head Drilling Screws For Metal: High-Strength 410 Stainless Steel, 1/4" Size, 2" Long
No ratings yet
Hex Head Drilling Screws For Metal: High-Strength 410 Stainless Steel, 1/4" Size, 2" Long
2 pages
Mapeh-Arts: First Quarter - Week 8
No ratings yet
Mapeh-Arts: First Quarter - Week 8
9 pages
Chapter 5 - Linear Optimization
No ratings yet
Chapter 5 - Linear Optimization
87 pages
EDU-531-P1-Handouts
No ratings yet
EDU-531-P1-Handouts
4 pages
Oup Lead2pass Og0-091 Sample Question 2022-Dec-01 by Hilary 137q Vce
No ratings yet
Oup Lead2pass Og0-091 Sample Question 2022-Dec-01 by Hilary 137q Vce
10 pages
Executive Order 51 - National Code (DOH)
100% (4)
Executive Order 51 - National Code (DOH)
4 pages
Manthropology Magazine
No ratings yet
Manthropology Magazine
76 pages
Eugene M. Gluhareffs Pressure Jet Engine PDF
No ratings yet
Eugene M. Gluhareffs Pressure Jet Engine PDF
26 pages
Art, Academe and The Language of Knowledge: Claire Robins
No ratings yet
Art, Academe and The Language of Knowledge: Claire Robins
16 pages
Econ203 Lab 081
No ratings yet
Econ203 Lab 081
36 pages
Hibbeler, Mechanics of Materials-Prismatic Beam Design
No ratings yet
Hibbeler, Mechanics of Materials-Prismatic Beam Design
50 pages
Knowing God Through Creation
No ratings yet
Knowing God Through Creation
3 pages
AI&ML Lecture1
No ratings yet
AI&ML Lecture1
14 pages
Roman Empire Elc
No ratings yet
Roman Empire Elc
14 pages
ECMAS Terrace Waterproofing Solutions
No ratings yet
ECMAS Terrace Waterproofing Solutions
8 pages
Instant download PEDIATRIC SURGERY a comprehensive textbook for africa 2nd Edition Emmanuel A. Ameh pdf all chapter
100% (1)
Instant download PEDIATRIC SURGERY a comprehensive textbook for africa 2nd Edition Emmanuel A. Ameh pdf all chapter
65 pages
BOXER KING Setup Operators Manual ENG
No ratings yet
BOXER KING Setup Operators Manual ENG
4 pages
Science Education Center - by Slidesgo
No ratings yet
Science Education Center - by Slidesgo
45 pages
Carbonate Sediments and Rocks A Manual For Earth Scientists and Engineers (Braithwaite, Colin J. R)
No ratings yet
Carbonate Sediments and Rocks A Manual For Earth Scientists and Engineers (Braithwaite, Colin J. R)
193 pages
Common Grammar Slips
No ratings yet
Common Grammar Slips
6 pages
Ashwin Nair Microentreprenuers
No ratings yet
Ashwin Nair Microentreprenuers
3 pages
Continue: English Grammar Book PDF Download in Urdu
No ratings yet
Continue: English Grammar Book PDF Download in Urdu
3 pages

Data Engineering - Solutions

Uploaded by

Data Engineering - Solutions

Uploaded by

Spark SQL & Hadoop Course

(For Data Scientists & Big Data

/// Read in data & create a DataFrame

var df_q1 = spark.read.format("csv")

/// Write code & save results

var col_names = df_q1.columns

/// Check Results

var check_1 = spark.read.format("avro")

Should be: 8665

Data Engineering - Solutions to Problems P a g e 1|7

/// Read in data & create a DataFrame

var df_q2 = spark.read.format("parquet")

/// Write code & save results

/// Check Results

var check_2 = spark.read.format("json")

Should be: 806

Data Engineering - Solutions to Problems P a g e 2|7

/// Read in data & create a DataFrame

var df_q3 = spark.read.format("parquet")

/// Write code & save results

/// Check Results

var check_3 = spark.read.format("orc")

Data Engineering - Solutions to Problems P a g e 3|7

/// Read in data & create a DataFrame

var df_q4 = spark.read.format("parquet")

/// Write code & save results

/// Check Results

var check_4 = spark.read.format("csv")

Should be: 10272116

Data Engineering - Solutions to Problems P a g e 4|7

/// Read in data & create a DataFrame

var df_q5 = spark.read.format("parquet")

/// Write code & save results

/// Check Results

var check_5 = spark.read.format("csv")

Should be: 834429

Data Engineering - Solutions to Problems P a g e 5|7

/// Read in data & create a DataFrame

var df_q6 = spark.read.format("parquet")

/// Write code & save results

/// Check Results

var check_6 = spark.sql("SELECT * FROM gp_db.q6_soln")

Should be: 4736749

End of Solutions to Problems

See the **Bonus Section** for more details.

Data Engineering - Solutions to Problems P a g e 7|7

You might also like

See the Bonus Section for more details.