0% found this document useful (0 votes)
27 views

Data Engineering - Solutions

This document provides solutions to 6 problems focused on data engineering with Spark SQL and Hadoop. Each problem loads data from source files, performs data transformations, and saves the results to destination files using different file formats like Parquet, JSON, ORC, text, and tables. The solutions are checked by loading the saved data and verifying record counts. Additional exam prep problems are recommended for further practice with data engineering concepts.

Uploaded by

dgnovo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Data Engineering - Solutions

This document provides solutions to 6 problems focused on data engineering with Spark SQL and Hadoop. Each problem loads data from source files, performs data transformations, and saves the results to destination files using different file formats like Parquet, JSON, ORC, text, and tables. The solutions are checked by loading the saved data and verifying record counts. Additional exam prep problems are recommended for further practice with data engineering concepts.

Uploaded by

dgnovo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Spark SQL & Hadoop Course

(For Data Scientists & Big Data


Analysts)
Data Engineering
Solutions to Problems

PROBLEM 1 - SOLUTION

/// Read in data & create a DataFrame

var df_q1 = spark.read.format("csv")


.option("header", "true")
.load("/user/verulam_blue/data/WHO_data/population_data.csv.bz2")

/// Write code & save results

var col_names = df_q1.columns

df_q1
.na.fill("NIL", col_names)
.write.format("avro")
.option("compression", "snappy")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem01/")

/// Check Results

var check_1 = spark.read.format("avro")


.option("header", "true")
.option("compression", "snappy")
.load("/user/vb_student/problems/section07/problem01/")

check_1.show(3, false)
check_1.count()

Should be: 8665

Data Engineering - Solutions to Problems P a g e 1|7


PROBLEM 2 - SOLUTION

/// Read in data & create a DataFrame

var df_q2 = spark.read.format("parquet")


.option("compression", "gzip")
.load("/user/verulam_blue/data/gp_db/practice_demographics/")

/// Write code & save results

df_q2
.filter("nbr_of_patients > 3000 AND nbr_of_patients < 4000")
.selectExpr("practice_code", "nbr_of_patients")
.write.format("json")
.option("compression", "deflate")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem02")

/// Check Results

var check_2 = spark.read.format("json")


.option("compression", "deflate")
.load("/user/vb_student/problems/section07/problem02")

check_2.show(3, false)
check_2.count()

Should be: 806

Data Engineering - Solutions to Problems P a g e 2|7


PROBLEM 3 - SOLUTION

/// Read in data & create a DataFrame

var df_q3 = spark.read.format("parquet")


.option("compression", "gzip")
.load("/user/verulam_blue/data/credit_cards")

/// Write code & save results

df_q3
.select("card_holder_name", "issuing_bank", "issue_date")
.write.format("orc")
.option("compression", "zlib")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem03")

/// Check Results

var check_3 = spark.read.format("orc")


.option("compression", "zlib")
.load("/user/vb_student/problems/section07/problem03/")

check_3.show(3, false)
check_3.count()

Should be:2000000

Data Engineering - Solutions to Problems P a g e 3|7


PROBLEM 4 - SOLUTION

/// Read in data & create a DataFrame

var df_q4 = spark.read.format("parquet")


.option("compression", "gzip")
.option("header", "true")
.load("/user/verulam_blue/data/gp_db/gp_rx")

/// Write code & save results

df_q4
.selectExpr("concat_ws('\t', sha, pct, practice_code, bnf_code, bnf_name, items, nic,
act_cost, quantity, period) as results")
.write.format("text")
.option("compression", "lz4")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem04")

/// Check Results

var check_4 = spark.read.format("csv")


.option("sep", "\t")
.option("compression", "lz4")
.load("/user/vb_student/problems/section07/problem04/")

check_4.show(3, false)
check_4.count()

Should be: 10272116

Data Engineering - Solutions to Problems P a g e 4|7


PROBLEM 5 - SOLUTION

/// Read in data & create a DataFrame

var df_q5 = spark.read.format("parquet")


.option("compression", "gzip")
.load("/user/verulam_blue/data/taxi_data")

/// Write code & save results

df_q5
.where(month($"pickup_datetime") === "03")
.selectExpr("concat_ws('|',*)")
.write.format("text")
.option("compression", "gzip")
.mode("overwrite")
.save("/user/vb_student/problems/section07/problem05")

/// Check Results

var check_5 = spark.read.format("csv")


.option("sep", "|")
.option("compression", "gzip")
.load("/user/vb_student/problems/section07/problem05/")

check_5.show(3, false)
check_5.count()

Should be: 834429

Data Engineering - Solutions to Problems P a g e 5|7


PROBLEM 6 - SOLUTION

/// Read in data & create a DataFrame

var df_q6 = spark.read.format("parquet")


.option("compression", "gzip")
.load("/user/verulam_blue/data/gp_db/gp_rx/")

/// Write code & save results

df_q6
.selectExpr("practice_code", "bnf_code", "bnf_name", "items", "nic", "act_cost",
"abs(nic-act_cost) as difference")
.where($"difference" > 2)
.drop("difference")
.coalesce(1)
.write.format("parquet")
.option("compression", "gzip")
.option("path", "/user/vb_student/problems/section07/problem06/")
.mode("append")
.saveAsTable("gp_db.q6_soln")

/// Check Results

var check_6 = spark.sql("SELECT * FROM gp_db.q6_soln")

check_6.show(3,false)
check_6.count()

Should be: 4736749

End of Solutions to Problems


Data Engineering - Solutions to Problems P a g e 6|7
For more problems that focus on the “Data Engineering” section of this course
see the course:

CCA175 Exam Prep Questions Part A ETL Focus (With Spark 2.4 Hadoop Cluster
VM)

See the **Bonus Section** for more details.

Data Engineering - Solutions to Problems P a g e 7|7

You might also like