0% found this document useful (0 votes)
12 views9 pages

Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1

The document outlines an interview questionnaire for assessing Big Data Quality Assurance skills, covering SQL data validation queries, PySpark scripts for data validation, and ETL pipeline testing. It includes specific tasks such as writing SQL queries for data aggregation, validating datasets, and implementing data quality checks. Additionally, it addresses testing scenarios in Azure Data Factory and performance validation techniques in Spark jobs.

Uploaded by

Numaira Rauf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views9 pages

Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1

The document outlines an interview questionnaire for assessing Big Data Quality Assurance skills, covering SQL data validation queries, PySpark scripts for data validation, and ETL pipeline testing. It includes specific tasks such as writing SQL queries for data aggregation, validating datasets, and implementing data quality checks. Additionally, it addresses testing scenarios in Azure Data Factory and performance validation techniques in Spark jobs.

Uploaded by

Numaira Rauf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Big Data Quality Assurance (Manual) -

Interview Questionnaire

Duration of the interview: 90 minutes


Notes for interviewer:
 This interview can be conducted without Azure data tools access
 The candidate can write code in any plain text editor
 Following datasets should be shared with the candidate in interview
meeting chat

SQL - Data Validation Queries (Section 1)


 Finding duplicate customer records within each daily partition.
Expected Query:

SELECT customer_id, COUNT(*)


FROM customers_table
GROUP BY customer_id, date_parition
HAVING COUNT(*) > 1;

 You have a table transactions (id, customer_id, amount, transaction_date).


Write a query to find the total amount spent by each customer in the last
30 days.
Expected Query:

SELECT customer_id, SUM(amount) AS total_spent


FROM transactions
WHERE transaction_date >= NOW() - INTERVAL '30 days'
GROUP BY customer_id;

 Find the sum of all transactions in USD per customer on 2/2/2022

Exchange Rates

Curre USD_R EFFECTIVE_START_DAT EFFECTIVE_END_DAT


ncy ate E E
PKR 300 1/1/2022 0:00 NULL
EUR 1.1 1/1/2022 0:00 NULL
GBP 1.1 1/1/2022 0:00 8/16/2022 0:00
GBP2 0.9 8/16/2022 0:00 NULL
Transactions

transaction_ customer_i transaction_d currenc amoun


id d ate y t
125 1001 7/25/2022 0:00 USD 250
126 1001 7/25/2022 0:00 EUR 150
127 1002 7/26/2022 0:00 PKR 200
128 1001 7/29/2022 0:00 PKR 300
129 1001 7/30/2022 0:00 GBP 400
129 1001 7/30/2022 0:03 GBP 400

Expected Query:

SELECT t.customer_id, SUM(t.amount * er.USD_Rate) AS total_spent_in_usd


FROM transactions t
INNER JOIN exchange_rates er ON t.currency = er.Currency
WHERE t.transaction_date >= er.EFFECTIVE_START_DATE
AND (t.transaction_date <= er.EFFECTIVE_END_DATE OR
er.EFFECTIVE_END_DATE IS NULL)
GROUP BY t.customer_id;
Pyspark/python for data validation (Section
2)
· Given a file path and filename pattern, write a Python script to check if a file exists in
a local directory and validate its contents.

customer_data.csv

customer_id,name,age,email,created_at

1,John Doe,30,[email protected],2023-01-15

2,Jane Smith,,[email protected],2023-02-20

3,Bob Johnson,45,,2023-03-10

4,Alice Brown,29,[email protected],

5,Charlie,40,charlie123@sample,2023-04-01

6,Emily White,27,[email protected],2023-04-10

7,David Black,,-,2023-05-05

8,Olivia Green,33,,2023-06-15

· Given customer_data.csv, Write a PySpark script to clean and validate a dataset


stored as a CSV file in a local directory.
· Implement a data quality validation script using Python/pyspark, ensuring all records
in customer_data.csv conform to a given schema.
 How would you write a pyspark script to filter out invalid records?

etl_source.csv
This is the raw input data before transformation.

id,name,value,status,created_date
1,Alpha,100,Active,2024-03-01
2,Beta,200,Inactive,2024-03-02
3,Gamma,300,Active,2024-03-03
4,Delta,400,Active,2024-03-04
5,Epsilon,500,Inactive,2024-03-05

etl_target.csv
This is the expected output after transformation.
Transformations applied:
1. Filtered out inactive records – Only rows with status=Active are included.
2. Added a new column updated_value – updated_value = value * 1.1 (10%
increment).
3. Renamed column created_date → processed_date – for consistency in naming.

id,name,value,updated_value,processed_date
1,Alpha,100,110.0,2024-03-01
3,Gamma,300,330.0,2024-03-03
4,Delta,400,440.0,2024-03-04

· Write a Python script that simulates the validation of an ETL pipeline’s output by
comparing etl_source.csv and etl_target.csv.

Expected Checks to be performed by Candidate:


1. Missing Values Check → Ensures mandatory fields are not NULL.
2. Uniqueness Check → Prevents duplicate primary keys.
3. Value Constraints → Ensures values are non-negative and numeric.
4. Data Type Validation → Confirms correct schema before loading.
5. Character Validation → Ensures proper formatting of textual fields.
6. Row Count Check → Ensures no unintended data loss during transformation.

PySpark for Data Processing (Section 3)


 Given a PySpark DataFrame df with columns ["id", "name", "age"], write a
PySpark script to filter records where age > 30 and display results.
Expected Code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFilter").getOrCreate()
df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 35), (3, "Charlie", 40)],
["id", "name", "age"])

filtered_df = df.filter(df.age > 30)


filtered_df.show()

 How would you write a PySpark script to check if two DataFrames df1 and
df2 are identical? We can give option to write sql using spark and then
convert them back to pyspark code.
Expected Code:

if df1.exceptAll(df2).count() == 0 and df2.exceptAll(df1).count() == 0:


print("DataFrames are identical")
else:
print("DataFrames are different")
 How would you test data deduplication logic in a Spark job?

 Provide a piece of code with Joins, Window Functions, and Aggregation,


and ask the candidate to write SQL queries and test cases for validation.
Sample Code for Candidate
python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, rank
from pyspark.sql.window import Window

# Initialize Spark Session


spark = SparkSession.builder.appName("BigDataTesting").getOrCreate()

# Sample DataFrames
customers_data = [
(1, "Alice", "USA"),
(2, "Bob", "UK"),
(3, "Charlie", "India"),
(4, "David", "USA")
]
transactions_data = [
(101, 1, 500, "2024-03-01"),
(102, 2, 300, "2024-03-02"),
(103, 1, 700, "2024-03-03"),
(104, 3, 400, "2024-03-03"),
(105, 2, 600, "2024-03-04"),
(106, 1, 200, "2024-03-05")
]

# Create DataFrames
customers_df = spark.createDataFrame(customers_data, ["customer_id",
"name", "country"])
transactions_df = spark.createDataFrame(transactions_data, ["txn_id",
"customer_id", "amount", "txn_date"])

# Join Customers with Transactions


joined_df = transactions_df.join(customers_df, "customer_id", "inner")

# Aggregation: Total Amount Spent by Each Customer


aggregated_df = joined_df.groupBy("customer_id",
"name").agg(sum("amount").alias("total_spent"))

# Window Function: Ranking Customers by Spending


window_spec = Window.orderBy(col("total_spent").desc())
ranked_df = aggregated_df.withColumn("rank", rank().over(window_spec))

# Show Results
ranked_df.show()

Code Description: The code joins two datasets (customers and transactions),
calculates the total amount spent by each customer, ranks them based on
spending, and displays the results.
Given the above PySpark code, write the following SQL queries and
corresponding test cases for validation:
(A) SQL Queries
 Write an SQL query to perform the same JOIN operation between
customers and transactions tables.
 Write an SQL query to calculate total amount spent by each customer.
 Write an SQL query to rank customers based on total spending, using
window functions.
* See section ‘Solution Hints (Private to Interviewer)’ to see the expected
outputs.

(B) Test Cases for Validation


The candidate should provide test cases to validate the above queries.

Test Case Input Data Expected Output

All transactions should have a


Test Join customers_df and
matching customer_id from
Output transactions_df as input
customers_df

Test Total Customer 1 has


total_spent for customer 1 should
Spending transactions: (500, 700,
be 1400
Calculation 200)

Test Customer If Customer 1 spends 1400 Customer 1 should have rank 1,


Ranking and Customer 2 spends 900 Customer 2 rank 2

 How do you test the correctness of a Spark transformation involving


multiple joins?
 What techniques do you use to validate performance in a Spark job?
 How would you test and debug failures in Spark jobs running on
Databricks?
 How do you validate a Spark job's output against expected business logic?

Azure Data Factory (ADF) and Azure Data


Lake Storage ADLS (Section 4):
 What are the different testing scenarios for pipelines in ADF?
 How do you test parameterized pipelines in ADF?
 What strategies do you use to test ADF pipeline dependencies and trigger
mechanisms?
 What are the common challenges in ADF testing, and how do you
overcome them?
 How would you automate data validation and quality checks on files stored
in ADLS?
 What strategies do you use to monitor data quality and performance in
ADLS as part of an ETL pipeline?
Testing Scenarios & Problem-Solving
(Section 5)
 You are tasked with validating a new transformation logic that involves
joining two large datasets. How do you approach testing?
 If a transformation logic fails in Databricks, how do you identify the issue?
 How do you approach testing incremental data loads vs. full loads?
 What are different SCDs in ETL?
 What types of negative testing do you perform in Data Lake and ETL
testing?
 What is boundary value analysis and other ways to create Test cases.
 How do you test failure scenarios in an ETL pipeline?
 What tools or frameworks do you use for data validation and automated
testing in a Data Lake or Data Warehouse environment?
 How do you test performance of spark jobs?

Behavior-Driven Development (Section 6)


 Write a sample BDD test case for validating a data transformation process
in Gherkin syntax.
o Example Scenario: Validate that the ETL pipeline correctly converts
transaction amounts from different currencies to USD
 Describe how you would manage and track defects using Jira X-Ray or Jira
Zephyr without direct integration to the cloud.

Data Warehousing and CI/CD (Nice to have)


(Section 7):
 How do you test data partitioning, indexing, and query optimization in a
data warehouse environment?
 How do you automate the testing of data pipelines in a CI/CD
environment?
 What are some common challenges you might face when automating data
tests in a CI/CD pipeline, and how would you solve them?
 What is a data warehouse, and how does it differ from a traditional
database?
 What is Azure Synapse, and how is it used in data warehousing and big
data analytics?
 How would you test data quality in a data warehouse like Azure Synapse?
 How would you automate ETL testing in Azure Synapse using Azure
DevOps or other tools?
Solution Hints (Private to Interviewer):
Expected Test Assertion:

def test_customer_id_sorted():
records = [
{"customer_id": 102, "name": "Alice"},
{"customer_id": 201, "name": "Bob"},
{"customer_id": 305, "name": "Charlie"}
]
sorted_records = sorted(records, key=lambda x: x["customer_id"])
assert records == sorted_records, "Records are not sorted by customer_id"

Expected SQL Queries:

-- 1. Perform Join
SELECT c.customer_id, c.name, c.country, t.txn_id, t.amount, t.txn_date
FROM transactions t
JOIN customers c ON t.customer_id = c.customer_id;

-- 2. Calculate Total Amount Spent per Customer


SELECT customer_id, name, SUM(amount) AS total_spent
FROM transactions t
JOIN customers c ON t.customer_id = c.customer_id
GROUP BY customer_id, name;

-- 3. Rank Customers by Total Spending


SELECT customer_id, name, SUM(amount) AS total_spent,
RANK() OVER (ORDER BY SUM(amount) DESC) AS rank
FROM transactions t
JOIN customers c ON t.customer_id = c.customer_id
GROUP BY customer_id, name;

Sample BDD Test Case for Validating a Data Transformation Process in


Gherkin Syntax
Feature: Currency Conversion in ETL Pipeline
Scenario: Validate that the ETL pipeline correctly converts transaction amounts
from different currencies to USD
Given the source dataset contains the following transactions:
| transaction_id | customer_id | amount | currency | exchange_rate |
| 1001 |1 | 250 | EUR | 1.1 |
| 1002 |2 | 300 | GBP | 1.3 |
| 1003 |3 | 500 | USD | 1.0 |

When the ETL pipeline transforms the data


Then the target dataset should contain the following transactions with Amount
in USD:
| transaction_id | customer_id | amount_usd |
| 1001 |1 | 275.00 |
| 1002 |2 | 390.00 |
| 1003 |3 | 500.00 |

And the transformed dataset should not contain any NULL or negative values in
the amount_usd column

You might also like