0% found this document useful (0 votes)

12 views9 pages

Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1

The document outlines an interview questionnaire for assessing Big Data Quality Assurance skills, covering SQL data validation queries, PySpark scripts for data validation, and ETL pipeline testing. It includes specific tasks such as writing SQL queries for data aggregation, validating datasets, and implementing data quality checks. Additionally, it addresses testing scenarios in Azure Data Factory and performance validation techniques in Spark jobs.

Uploaded by

Numaira Rauf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views9 pages

Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1

Uploaded by

Numaira Rauf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Big Data Quality Assurance (Manual) -

Interview Questionnaire

Duration of the interview: 90 minutes

Notes for interviewer:
 This interview can be conducted without Azure data tools access
 The candidate can write code in any plain text editor
 Following datasets should be shared with the candidate in interview
meeting chat

SQL - Data Validation Queries (Section 1)

 Finding duplicate customer records within each daily partition.
Expected Query:

SELECT customer_id, COUNT(*)

FROM customers_table
GROUP BY customer_id, date_parition
HAVING COUNT(*) > 1;

 You have a table transactions (id, customer_id, amount, transaction_date).

Write a query to find the total amount spent by each customer in the last
30 days.
Expected Query:

SELECT customer_id, SUM(amount) AS total_spent

FROM transactions
WHERE transaction_date >= NOW() - INTERVAL '30 days'
GROUP BY customer_id;

 Find the sum of all transactions in USD per customer on 2/2/2022

Exchange Rates

Curre USD_R EFFECTIVE_START_DAT EFFECTIVE_END_DAT

ncy ate E E
PKR 300 1/1/2022 0:00 NULL
EUR 1.1 1/1/2022 0:00 NULL
GBP 1.1 1/1/2022 0:00 8/16/2022 0:00
GBP2 0.9 8/16/2022 0:00 NULL
Transactions

transaction_ customer_i transaction_d currenc amoun

id d ate y t
125 1001 7/25/2022 0:00 USD 250
126 1001 7/25/2022 0:00 EUR 150
127 1002 7/26/2022 0:00 PKR 200
128 1001 7/29/2022 0:00 PKR 300
129 1001 7/30/2022 0:00 GBP 400
129 1001 7/30/2022 0:03 GBP 400

Expected Query:

SELECT t.customer_id, SUM(t.amount * er.USD_Rate) AS total_spent_in_usd

FROM transactions t
INNER JOIN exchange_rates er ON t.currency = er.Currency
WHERE t.transaction_date >= er.EFFECTIVE_START_DATE
AND (t.transaction_date <= er.EFFECTIVE_END_DATE OR
er.EFFECTIVE_END_DATE IS NULL)
GROUP BY t.customer_id;
Pyspark/python for data validation (Section
2)
· Given a file path and filename pattern, write a Python script to check if a file exists in
a local directory and validate its contents.

customer_data.csv

customer_id,name,age,email,created_at

1,John Doe,30,[email protected],2023-01-15

2,Jane Smith,,[email protected],2023-02-20

3,Bob Johnson,45,,2023-03-10

4,Alice Brown,29,[email protected],

5,Charlie,40,charlie123@sample,2023-04-01

6,Emily White,27,[email protected],2023-04-10

7,David Black,,-,2023-05-05

8,Olivia Green,33,,2023-06-15

· Given customer_data.csv, Write a PySpark script to clean and validate a dataset

stored as a CSV file in a local directory.
· Implement a data quality validation script using Python/pyspark, ensuring all records
in customer_data.csv conform to a given schema.
 How would you write a pyspark script to filter out invalid records?

etl_source.csv
This is the raw input data before transformation.

id,name,value,status,created_date
1,Alpha,100,Active,2024-03-01
2,Beta,200,Inactive,2024-03-02
3,Gamma,300,Active,2024-03-03
4,Delta,400,Active,2024-03-04
5,Epsilon,500,Inactive,2024-03-05

etl_target.csv
This is the expected output after transformation.
Transformations applied:
1. Filtered out inactive records – Only rows with status=Active are included.
2. Added a new column updated_value – updated_value = value * 1.1 (10%
increment).
3. Renamed column created_date → processed_date – for consistency in naming.

id,name,value,updated_value,processed_date
1,Alpha,100,110.0,2024-03-01
3,Gamma,300,330.0,2024-03-03
4,Delta,400,440.0,2024-03-04

· Write a Python script that simulates the validation of an ETL pipeline’s output by
comparing etl_source.csv and etl_target.csv.

Expected Checks to be performed by Candidate:

1. Missing Values Check → Ensures mandatory fields are not NULL.
2. Uniqueness Check → Prevents duplicate primary keys.
3. Value Constraints → Ensures values are non-negative and numeric.
4. Data Type Validation → Confirms correct schema before loading.
5. Character Validation → Ensures proper formatting of textual fields.
6. Row Count Check → Ensures no unintended data loss during transformation.

PySpark for Data Processing (Section 3)

 Given a PySpark DataFrame df with columns ["id", "name", "age"], write a
PySpark script to filter records where age > 30 and display results.
Expected Code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFilter").getOrCreate()
df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 35), (3, "Charlie", 40)],
["id", "name", "age"])

filtered_df = df.filter(df.age > 30)

filtered_df.show()

 How would you write a PySpark script to check if two DataFrames df1 and
df2 are identical? We can give option to write sql using spark and then
convert them back to pyspark code.
Expected Code:

if df1.exceptAll(df2).count() == 0 and df2.exceptAll(df1).count() == 0:

print("DataFrames are identical")
else:
print("DataFrames are different")
 How would you test data deduplication logic in a Spark job?

 Provide a piece of code with Joins, Window Functions, and Aggregation,

and ask the candidate to write SQL queries and test cases for validation.
Sample Code for Candidate
python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, rank
from pyspark.sql.window import Window

# Initialize Spark Session

spark = SparkSession.builder.appName("BigDataTesting").getOrCreate()

# Sample DataFrames
customers_data = [
(1, "Alice", "USA"),
(2, "Bob", "UK"),
(3, "Charlie", "India"),
(4, "David", "USA")
]
transactions_data = [
(101, 1, 500, "2024-03-01"),
(102, 2, 300, "2024-03-02"),
(103, 1, 700, "2024-03-03"),
(104, 3, 400, "2024-03-03"),
(105, 2, 600, "2024-03-04"),
(106, 1, 200, "2024-03-05")
]

# Create DataFrames
customers_df = spark.createDataFrame(customers_data, ["customer_id",
"name", "country"])
transactions_df = spark.createDataFrame(transactions_data, ["txn_id",
"customer_id", "amount", "txn_date"])

# Join Customers with Transactions

joined_df = transactions_df.join(customers_df, "customer_id", "inner")

# Aggregation: Total Amount Spent by Each Customer

aggregated_df = joined_df.groupBy("customer_id",
"name").agg(sum("amount").alias("total_spent"))

# Window Function: Ranking Customers by Spending

window_spec = Window.orderBy(col("total_spent").desc())
ranked_df = aggregated_df.withColumn("rank", rank().over(window_spec))

# Show Results
ranked_df.show()

Code Description: The code joins two datasets (customers and transactions),
calculates the total amount spent by each customer, ranks them based on
spending, and displays the results.
Given the above PySpark code, write the following SQL queries and
corresponding test cases for validation:
(A) SQL Queries
 Write an SQL query to perform the same JOIN operation between
customers and transactions tables.
 Write an SQL query to calculate total amount spent by each customer.
 Write an SQL query to rank customers based on total spending, using
window functions.
* See section ‘Solution Hints (Private to Interviewer)’ to see the expected
outputs.

(B) Test Cases for Validation

The candidate should provide test cases to validate the above queries.

Test Case Input Data Expected Output

All transactions should have a

Test Join customers_df and
matching customer_id from
Output transactions_df as input
customers_df

Test Total Customer 1 has

total_spent for customer 1 should
Spending transactions: (500, 700,
be 1400
Calculation 200)

Test Customer If Customer 1 spends 1400 Customer 1 should have rank 1,

Ranking and Customer 2 spends 900 Customer 2 rank 2

 How do you test the correctness of a Spark transformation involving

multiple joins?
 What techniques do you use to validate performance in a Spark job?
 How would you test and debug failures in Spark jobs running on
Databricks?
 How do you validate a Spark job's output against expected business logic?

Azure Data Factory (ADF) and Azure Data

Lake Storage ADLS (Section 4):
 What are the different testing scenarios for pipelines in ADF?
 How do you test parameterized pipelines in ADF?
 What strategies do you use to test ADF pipeline dependencies and trigger
mechanisms?
 What are the common challenges in ADF testing, and how do you
overcome them?
 How would you automate data validation and quality checks on files stored
in ADLS?
 What strategies do you use to monitor data quality and performance in
ADLS as part of an ETL pipeline?
Testing Scenarios & Problem-Solving
(Section 5)
 You are tasked with validating a new transformation logic that involves
joining two large datasets. How do you approach testing?
 If a transformation logic fails in Databricks, how do you identify the issue?
 How do you approach testing incremental data loads vs. full loads?
 What are different SCDs in ETL?
 What types of negative testing do you perform in Data Lake and ETL
testing?
 What is boundary value analysis and other ways to create Test cases.
 How do you test failure scenarios in an ETL pipeline?
 What tools or frameworks do you use for data validation and automated
testing in a Data Lake or Data Warehouse environment?
 How do you test performance of spark jobs?

Behavior-Driven Development (Section 6)

 Write a sample BDD test case for validating a data transformation process
in Gherkin syntax.
o Example Scenario: Validate that the ETL pipeline correctly converts
transaction amounts from different currencies to USD
 Describe how you would manage and track defects using Jira X-Ray or Jira
Zephyr without direct integration to the cloud.


Data Warehousing and CI/CD (Nice to have)

(Section 7):
 How do you test data partitioning, indexing, and query optimization in a
data warehouse environment?
 How do you automate the testing of data pipelines in a CI/CD
environment?
 What are some common challenges you might face when automating data
tests in a CI/CD pipeline, and how would you solve them?
 What is a data warehouse, and how does it differ from a traditional
database?
 What is Azure Synapse, and how is it used in data warehousing and big
data analytics?
 How would you test data quality in a data warehouse like Azure Synapse?
 How would you automate ETL testing in Azure Synapse using Azure
DevOps or other tools?
Solution Hints (Private to Interviewer):
Expected Test Assertion:

def test_customer_id_sorted():
records = [
{"customer_id": 102, "name": "Alice"},
{"customer_id": 201, "name": "Bob"},
{"customer_id": 305, "name": "Charlie"}
]
sorted_records = sorted(records, key=lambda x: x["customer_id"])
assert records == sorted_records, "Records are not sorted by customer_id"

Expected SQL Queries:

-- 1. Perform Join
SELECT c.customer_id, c.name, c.country, t.txn_id, t.amount, t.txn_date
FROM transactions t
JOIN customers c ON t.customer_id = c.customer_id;

-- 2. Calculate Total Amount Spent per Customer

SELECT customer_id, name, SUM(amount) AS total_spent
FROM transactions t
JOIN customers c ON t.customer_id = c.customer_id
GROUP BY customer_id, name;

-- 3. Rank Customers by Total Spending

SELECT customer_id, name, SUM(amount) AS total_spent,
RANK() OVER (ORDER BY SUM(amount) DESC) AS rank
FROM transactions t
JOIN customers c ON t.customer_id = c.customer_id
GROUP BY customer_id, name;

Sample BDD Test Case for Validating a Data Transformation Process in

Gherkin Syntax
Feature: Currency Conversion in ETL Pipeline
Scenario: Validate that the ETL pipeline correctly converts transaction amounts
from different currencies to USD
Given the source dataset contains the following transactions:
| transaction_id | customer_id | amount | currency | exchange_rate |
| 1001 |1 | 250 | EUR | 1.1 |
| 1002 |2 | 300 | GBP | 1.3 |
| 1003 |3 | 500 | USD | 1.0 |

When the ETL pipeline transforms the data

Then the target dataset should contain the following transactions with Amount
in USD:
| transaction_id | customer_id | amount_usd |
| 1001 |1 | 275.00 |
| 1002 |2 | 390.00 |
| 1003 |3 | 500.00 |

And the transformed dataset should not contain any NULL or negative values in
the amount_usd column

Ford Motor Data Analyst Interview Questions
No ratings yet
Ford Motor Data Analyst Interview Questions
21 pages
ETL Test Scenarios and Test Cases
78% (9)
ETL Test Scenarios and Test Cases
5 pages
Databricks Quiz Questions
No ratings yet
Databricks Quiz Questions
35 pages
Myntra Data Analyst Interview Questions
No ratings yet
Myntra Data Analyst Interview Questions
34 pages
Interview Questions
No ratings yet
Interview Questions
29 pages
Walmart Data Engineering Question
No ratings yet
Walmart Data Engineering Question
10 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Placement Preparation Material
No ratings yet
Placement Preparation Material
22 pages
Manikandan Resume (Mar2025)
No ratings yet
Manikandan Resume (Mar2025)
8 pages
SQL For Interview
No ratings yet
SQL For Interview
4 pages
MYSQL Assignemnt Questions
No ratings yet
MYSQL Assignemnt Questions
8 pages
Day Wise Assignment
0% (1)
Day Wise Assignment
14 pages
ETL Process
100% (2)
ETL Process
11 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
3 pages
Interview
No ratings yet
Interview
2 pages
Databricks Certified Data Engineer Associate Exam Guide
No ratings yet
Databricks Certified Data Engineer Associate Exam Guide
7 pages
Amazon Interview Questions & Answers
No ratings yet
Amazon Interview Questions & Answers
8 pages
Amazon Interview Questions
No ratings yet
Amazon Interview Questions
7 pages
Naveen Resume
No ratings yet
Naveen Resume
4 pages
Amazon Interview Questions
No ratings yet
Amazon Interview Questions
7 pages
Teradata Certification: Advanced Developer Exam
No ratings yet
Teradata Certification: Advanced Developer Exam
3 pages
Exam 70-761: Querying Data With Transact-SQL - Skills Measured
No ratings yet
Exam 70-761: Querying Data With Transact-SQL - Skills Measured
3 pages
Anusuya Resume
No ratings yet
Anusuya Resume
5 pages
Pranav Resume
No ratings yet
Pranav Resume
2 pages
Etl Testing
No ratings yet
Etl Testing
32 pages
Dharesh Resume
No ratings yet
Dharesh Resume
6 pages
Solution:-: 'Customers' 'Customer - Id' 'Name'
No ratings yet
Solution:-: 'Customers' 'Customer - Id' 'Name'
9 pages
Akshay Chekuri
No ratings yet
Akshay Chekuri
4 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Questions For Preparation
No ratings yet
Questions For Preparation
9 pages
Company Interview
No ratings yet
Company Interview
24 pages
Ade 1737191501
No ratings yet
Ade 1737191501
29 pages
Unit I-Introduction
100% (1)
Unit I-Introduction
23 pages
R01 - 1 (1480)
No ratings yet
R01 - 1 (1480)
4 pages
DATA Testing and ETL Testing.. .
No ratings yet
DATA Testing and ETL Testing.. .
3 pages
Please Help Me With Real Time SQL Query For ETL T...
No ratings yet
Please Help Me With Real Time SQL Query For ETL T...
3 pages
Day3 Datanalyst
No ratings yet
Day3 Datanalyst
10 pages
Question
No ratings yet
Question
24 pages
SQL Questions
No ratings yet
SQL Questions
25 pages
Rajat Awasthi
No ratings yet
Rajat Awasthi
2 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Text 3
No ratings yet
Text 3
3 pages
TCS Rejected Many Due To Weak PySpark Logic!?
No ratings yet
TCS Rejected Many Due To Weak PySpark Logic!?
7 pages
Data Scientist - JD (QuGates)
No ratings yet
Data Scientist - JD (QuGates)
1 page
Advanced SQL Assignment
No ratings yet
Advanced SQL Assignment
2 pages
@Arcserve@Operations Analyst Hyderabad Remote
No ratings yet
@Arcserve@Operations Analyst Hyderabad Remote
10 pages
Singh Advanced Data Cleaning Techniques For E-Commerce Projects
No ratings yet
Singh Advanced Data Cleaning Techniques For E-Commerce Projects
14 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Project Descriptioin
No ratings yet
Project Descriptioin
5 pages
Sagar Modi Resume 1
No ratings yet
Sagar Modi Resume 1
2 pages
Testing 2
No ratings yet
Testing 2
20 pages
Hrushi de Update
No ratings yet
Hrushi de Update
2 pages
FAQ Fo ETL TESTING
No ratings yet
FAQ Fo ETL TESTING
9 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
Hanumantha Rao Resume-1 (4391)
No ratings yet
Hanumantha Rao Resume-1 (4391)
4 pages
Abdul Gani Khan - Data Anayst - PDF
No ratings yet
Abdul Gani Khan - Data Anayst - PDF
2 pages
Student Information System
No ratings yet
Student Information System
20 pages
Common Interview Questions For Data Engineering
No ratings yet
Common Interview Questions For Data Engineering
4 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
3 pages
SAP Analytics Emergency License Key Process PDF
No ratings yet
SAP Analytics Emergency License Key Process PDF
4 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Case Study: Using MongoDB For An E-Commerce Platform
100% (8)
Case Study: Using MongoDB For An E-Commerce Platform
32 pages
Data Vault Moduling
50% (2)
Data Vault Moduling
14 pages
What Are Database Recovery Techniques
No ratings yet
What Are Database Recovery Techniques
11 pages
DM104 - Evaluation of Business Performance
No ratings yet
DM104 - Evaluation of Business Performance
15 pages
Profile Summary: Vishal Vaibhav
No ratings yet
Profile Summary: Vishal Vaibhav
4 pages
Chapter 1
No ratings yet
Chapter 1
45 pages
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
No ratings yet
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
5 pages
Ip Practical File
No ratings yet
Ip Practical File
29 pages
JD Edwards L BASE Procurement User Guide V1.7
No ratings yet
JD Edwards L BASE Procurement User Guide V1.7
377 pages
Oracle BI Publisher Report Creation
No ratings yet
Oracle BI Publisher Report Creation
13 pages
MapInfo - (Plot IEs - ) Share It!!!
No ratings yet
MapInfo - (Plot IEs - ) Share It!!!
19 pages
Bioinformatics (STH Sir)
No ratings yet
Bioinformatics (STH Sir)
13 pages
Apache HBase
No ratings yet
Apache HBase
12 pages
DS Syllubus GL
No ratings yet
DS Syllubus GL
8 pages
Elibrary Proposal
No ratings yet
Elibrary Proposal
2 pages
Shreyansh - SPORTS CLUB MANAGEMENT SYSTEM
No ratings yet
Shreyansh - SPORTS CLUB MANAGEMENT SYSTEM
39 pages
HCI Lesson Plan
No ratings yet
HCI Lesson Plan
1 page
Unit 4 NMU
No ratings yet
Unit 4 NMU
4 pages
MUHAMMAD HAZIQ ICT450 - NBCS2403B - Grp2 - Project Proposal Combined
No ratings yet
MUHAMMAD HAZIQ ICT450 - NBCS2403B - Grp2 - Project Proposal Combined
10 pages
BDA Unit 2 B.tech
No ratings yet
BDA Unit 2 B.tech
9 pages
Ahalts White Paper
No ratings yet
Ahalts White Paper
16 pages
Usage of Cluster Analysis in Consumer Behavior Res
No ratings yet
Usage of Cluster Analysis in Consumer Behavior Res
7 pages
100 ETL Questions
No ratings yet
100 ETL Questions
5 pages
DSES Assignmet 2
No ratings yet
DSES Assignmet 2
29 pages
DFD and User Interface
No ratings yet
DFD and User Interface
38 pages
Intro To Web and e Commerce User Experience Design Handouts 1
No ratings yet
Intro To Web and e Commerce User Experience Design Handouts 1
6 pages
Data Mining with Microsoft SQL Server 2008
From Everand
Data Mining with Microsoft SQL Server 2008
Jamie MacLennan
4/5 (1)
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet

Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1

Uploaded by

Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1

Uploaded by

Big Data Quality Assurance (Manual) -

Duration of the interview: 90 minutes

SQL - Data Validation Queries (Section 1)

SELECT customer_id, COUNT(*)

 You have a table transactions (id, customer_id, amount, transaction_date).

SELECT customer_id, SUM(amount) AS total_spent

 Find the sum of all transactions in USD per customer on 2/2/2022

Curre USD_R EFFECTIVE_START_DAT EFFECTIVE_END_DAT

transaction_ customer_i transaction_d currenc amoun

SELECT t.customer_id, SUM(t.amount * er.USD_Rate) AS total_spent_in_usd

· Given customer_data.csv, Write a PySpark script to clean and validate a dataset

Expected Checks to be performed by Candidate:

PySpark for Data Processing (Section 3)

from pyspark.sql import SparkSession

filtered_df = df.filter(df.age > 30)

if df1.exceptAll(df2).count() == 0 and df2.exceptAll(df1).count() == 0:

 Provide a piece of code with Joins, Window Functions, and Aggregation,

# Initialize Spark Session

# Join Customers with Transactions

# Aggregation: Total Amount Spent by Each Customer

# Window Function: Ranking Customers by Spending

(B) Test Cases for Validation

Test Case Input Data Expected Output

All transactions should have a

Test Total Customer 1 has

Test Customer If Customer 1 spends 1400 Customer 1 should have rank 1,

 How do you test the correctness of a Spark transformation involving

Azure Data Factory (ADF) and Azure Data

Behavior-Driven Development (Section 6)

Data Warehousing and CI/CD (Nice to have)

Expected SQL Queries:

-- 2. Calculate Total Amount Spent per Customer

-- 3. Rank Customers by Total Spending

Sample BDD Test Case for Validating a Data Transformation Process in

When the ETL pipeline transforms the data

You might also like