Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1
Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1
Interview Questionnaire
Exchange Rates
Expected Query:
customer_data.csv
customer_id,name,age,email,created_at
1,John Doe,30,[email protected],2023-01-15
2,Jane Smith,,[email protected],2023-02-20
3,Bob Johnson,45,,2023-03-10
4,Alice Brown,29,[email protected],
5,Charlie,40,charlie123@sample,2023-04-01
6,Emily White,27,[email protected],2023-04-10
7,David Black,,-,2023-05-05
8,Olivia Green,33,,2023-06-15
etl_source.csv
This is the raw input data before transformation.
id,name,value,status,created_date
1,Alpha,100,Active,2024-03-01
2,Beta,200,Inactive,2024-03-02
3,Gamma,300,Active,2024-03-03
4,Delta,400,Active,2024-03-04
5,Epsilon,500,Inactive,2024-03-05
etl_target.csv
This is the expected output after transformation.
Transformations applied:
1. Filtered out inactive records – Only rows with status=Active are included.
2. Added a new column updated_value – updated_value = value * 1.1 (10%
increment).
3. Renamed column created_date → processed_date – for consistency in naming.
id,name,value,updated_value,processed_date
1,Alpha,100,110.0,2024-03-01
3,Gamma,300,330.0,2024-03-03
4,Delta,400,440.0,2024-03-04
· Write a Python script that simulates the validation of an ETL pipeline’s output by
comparing etl_source.csv and etl_target.csv.
spark = SparkSession.builder.appName("DataFilter").getOrCreate()
df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 35), (3, "Charlie", 40)],
["id", "name", "age"])
How would you write a PySpark script to check if two DataFrames df1 and
df2 are identical? We can give option to write sql using spark and then
convert them back to pyspark code.
Expected Code:
# Sample DataFrames
customers_data = [
(1, "Alice", "USA"),
(2, "Bob", "UK"),
(3, "Charlie", "India"),
(4, "David", "USA")
]
transactions_data = [
(101, 1, 500, "2024-03-01"),
(102, 2, 300, "2024-03-02"),
(103, 1, 700, "2024-03-03"),
(104, 3, 400, "2024-03-03"),
(105, 2, 600, "2024-03-04"),
(106, 1, 200, "2024-03-05")
]
# Create DataFrames
customers_df = spark.createDataFrame(customers_data, ["customer_id",
"name", "country"])
transactions_df = spark.createDataFrame(transactions_data, ["txn_id",
"customer_id", "amount", "txn_date"])
# Show Results
ranked_df.show()
Code Description: The code joins two datasets (customers and transactions),
calculates the total amount spent by each customer, ranks them based on
spending, and displays the results.
Given the above PySpark code, write the following SQL queries and
corresponding test cases for validation:
(A) SQL Queries
Write an SQL query to perform the same JOIN operation between
customers and transactions tables.
Write an SQL query to calculate total amount spent by each customer.
Write an SQL query to rank customers based on total spending, using
window functions.
* See section ‘Solution Hints (Private to Interviewer)’ to see the expected
outputs.
def test_customer_id_sorted():
records = [
{"customer_id": 102, "name": "Alice"},
{"customer_id": 201, "name": "Bob"},
{"customer_id": 305, "name": "Charlie"}
]
sorted_records = sorted(records, key=lambda x: x["customer_id"])
assert records == sorted_records, "Records are not sorted by customer_id"
-- 1. Perform Join
SELECT c.customer_id, c.name, c.country, t.txn_id, t.amount, t.txn_date
FROM transactions t
JOIN customers c ON t.customer_id = c.customer_id;
And the transformed dataset should not contain any NULL or negative values in
the amount_usd column