𝗡𝗧𝗧 𝗗𝗔𝗧𝗔
𝗦𝗤𝗟 questions in a Data Engineering interview
CTC → 𝟮𝟴-𝟯𝟱 𝗟𝗣𝗔
(Difficulty level = Intermediate)
1. Write a SQL query to get the daily count of active users
(logged in at least once).
Objective:
Find how many distinct users logged in on each date (i.e., were active).
Sample Table: user_logins
user_id login_datetime
101 2024-06-01 08:30:00
102 2024-06-01 09:15:00
101 2024-06-01 18:20:00
103 2024-06-02 10:45:00
102 2024-06-02 14:00:00
Query:
SELECT
DATE(login_datetime) AS login_date,
COUNT(DISTINCT user_id) AS active_user_count
FROM user_logins
GROUP BY DATE(login_datetime)
ORDER BY login_date;
Output:
login_date active_user_count
2024-06-01 2
2024-06-02 2
2. Find the 2nd highest transaction per user without using
LIMIT or TOP
Objective:
For each user, return their 2nd highest transaction amount (not using LIMIT or TOP).
Sample Table: transactions
user_id transaction_amount
1 500
1 800
1 1000
2 200
2 300
Query:
SELECT user_id, transaction_amount
FROM (
SELECT *,
DENSE_RANK() OVER (PARTITION BY user_id ORDER BY transaction_amount DESC)
AS txn_rank
FROM transactions
) ranked_txns
WHERE txn_rank = 2;
Output:
user_id transaction_amount
1 800
2 200
3. Identify data gaps in time-series event logs (e.g., missing
hourly records)
Objective:
Find hours where no events occurred, assuming continuous hourly logs are expected.
Sample Table: event_logs
event_time
2024-06-01 00:00:00
2024-06-01 01:00:00
2024-06-01 03:00:00
2024-06-01 04:00:00
Step-by-step:
1. Generate expected hourly time slots.
2. Left join event_logs on those.
3. Filter NULL (missing hours).
Query (for PostgreSQL / Snowflake / Databricks SQL):
-- Step 1: Define time range
WITH time_series AS (
SELECT generate_series(
TIMESTAMP '2024-06-01 00:00:00',
TIMESTAMP '2024-06-01 05:00:00',
INTERVAL '1 hour'
) AS expected_time
),
-- Step 2: Round actual event times to the hour
rounded_events AS (
SELECT date_trunc('hour', event_time) AS event_hour
FROM event_logs
-- Step 3: Find missing hours
SELECT ts.expected_time AS missing_hour
FROM time_series ts
LEFT JOIN rounded_events re
ON ts.expected_time = re.event_hour
WHERE re.event_hour IS NULL
ORDER BY missing_hour;
Output:
missing_hour
2024-06-01 02:00:00
2024-06-01 05:00:00
Summary Table
Question No Topic Key Concepts
1 Active Users Daily GROUP BY, COUNT(DISTINCT)
2 2nd Highest per Group DENSE_RANK(), OVER(PARTITION BY)
3 Time Series Gaps Detection generate_series, LEFT JOIN
4. Fetch the first purchase date per user and calculate days
since then .
Objective:
• Find each user's first purchase date.
• Calculate days passed since that purchase.
Sample Table: purchases
user_id purchase_date
101 2024-06-01
user_id purchase_date
101 2024-06-10
102 2024-06-05
102 2024-06-25
Query:
SELECT
user_id,
MIN(purchase_date) AS first_purchase_date,
DATEDIFF(CURRENT_DATE, MIN(purchase_date)) AS days_since_first_purchase
FROM purchases
GROUP BY user_id;
Output:
user_id first_purchase_date days_since_first_purchase
101 2024-06-01 47
102 2024-06-05 43
(Assuming today's date is 2024-07-18)
Alternate using CTE & Window Functions (if more detail is needed):
WITH ranked_purchases AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY purchase_date ASC) AS rn
FROM purchases
)
SELECT
user_id,
purchase_date AS first_purchase_date,
DATEDIFF(CURRENT_DATE, purchase_date) AS days_since_first_purchase
FROM ranked_purchases
WHERE rn = 1;
5. Detect schema changes in SCD Type 2 tables using Delta
Lake
Objective:
In SCD Type 2, every update creates a new version (row). You need to detect schema
drift—i.e., when columns are added, removed, or changed in Delta Lake over time.
Delta Lake Table Example: /mnt/datalake/scd_customer_data
Method 1: Using Delta’s DESCRIBE HISTORY
DESCRIBE HISTORY delta.`/mnt/datalake/scd_customer_data`;
➤ Output:
version timestamp operation operationMetrics userMetadata
3 2024-07-15 12:00:00 WRITE {“numOutputRows”: 20} added column “email”
Method 2: Compare Schema Across Versions
-- Load schema at version 1
DESCRIBE TABLE delta.`/mnt/datalake/scd_customer_data` VERSION AS OF 1;
-- Load schema at latest version
DESCRIBE TABLE delta.`/mnt/datalake/scd_customer_data`;
Manually compare or automate by reading schema as DataFrames:
# PySpark example
df_v1 = spark.read.format("delta").option("versionAsOf",
1).load("/mnt/datalake/scd_customer_data")
df_latest = spark.read.format("delta").load("/mnt/datalake/scd_customer_data")
# Compare schemas
set(df_v1.schema.names) ^ set(df_latest.schema.names)
This will show added/dropped columns → schema drift detection.
6. Join product and transaction tables and filter out null
foreign keys safely
Objective:
Avoid joining with NULL product_id in transaction data to prevent incorrect or missing join
results.
Sample Tables:
transactions
txn_id user_id product_id
1 101 501
2 102 NULL
3 103 502
products
product_id product_name
501 Laptop
502 Monitor
503 Keyboard
Bad Join (unsafe – includes NULLs):
SELECT *
FROM transactions t
LEFT JOIN products p
ON t.product_id = p.product_id;
→ This includes rows with NULL product_id, causing ambiguous or misleading results.
Safe Join (filtering nulls first):
SELECT *
FROM transactions t
JOIN products p
ON t.product_id = p.product_id
WHERE t.product_id IS NOT NULL;
Or, with LEFT JOIN to keep all transactions but highlight unmatched products:
SELECT t.*, p.product_name
FROM transactions t
LEFT JOIN products p
ON t.product_id = p.product_id
WHERE t.product_id IS NOT NULL;
Output:
txn_id user_id product_id product_name
1 101 501 Laptop
3 103 502 Monitor
Summary Table
Q No Topic Focus Area
4 First Purchase Date & Days Elapsed MIN(), DATEDIFF, GROUP BY
5 Schema Drift in Delta Lake (SCD2) DESCRIBE HISTORY, versioning
6 Safe Join on Foreign Keys JOIN, NULL handling
7. Get users who upgraded to premium within 7 days of
signup
Objective:
Find users who upgraded to premium within 7 days of their signup date.
Sample Table: users
user_id signup_date premium_upgrade_date
101 2024-01-01 2024-01-05
102 2024-02-01 2024-02-15
user_id signup_date premium_upgrade_date
103 2024-03-01 NULL
104 2024-04-01 2024-04-06
Query:
SELECT *
FROM users
WHERE premium_upgrade_date IS NOT NULL
AND DATEDIFF(premium_upgrade_date, signup_date) <= 7;
Output:
user_id signup_date premium_upgrade_date
101 2024-01-01 2024-01-05
104 2024-04-01 2024-04-06
8. Calculate cumulative distinct product purchases per
customer
Objective:
For each customer and each purchase date, calculate the cumulative count of distinct
products purchased so far.
Sample Table: purchases
customer_id product_id purchase_date
1 A 2024-01-01
customer_id product_id purchase_date
1 B 2024-01-03
1 A 2024-01-05
1 C 2024-01-06
2 B 2024-01-02
2 B 2024-01-04
Solution using Window Functions + Subquery:
WITH product_log AS (
SELECT DISTINCT customer_id, product_id, purchase_date
FROM purchases
),
running_distincts AS (
SELECT
p1.customer_id,
p1.purchase_date,
COUNT(DISTINCT p2.product_id) AS cumulative_distinct_products
FROM product_log p1
JOIN product_log p2
ON p1.customer_id = p2.customer_id
AND p2.purchase_date <= p1.purchase_date
GROUP BY p1.customer_id, p1.purchase_date
)
SELECT *
FROM running_distincts
ORDER BY customer_id, purchase_date;
Output:
customer_id purchase_date cumulative_distinct_products
1 2024-01-01 1
1 2024-01-03 2
1 2024-01-05 2
1 2024-01-06 3
2 2024-01-02 1
2 2024-01-04 1
9. Retrieve customers who spent above average in their
region
Objective:
Find customers who spent more than their regional average.
Sample Table: customer_spending
customer_id region total_spent
1 North 1000
2 North 1200
3 South 900
customer_id region total_spent
4 South 600
5 West 700
Query using AVG() OVER():
SELECT *,
AVG(total_spent) OVER (PARTITION BY region) AS regional_avg
FROM customer_spending
QUALIFY total_spent > AVG(total_spent) OVER (PARTITION BY region);
If QUALIFY is not supported (like in MySQL), use a subquery:
WITH region_avg AS (
SELECT *,
AVG(total_spent) OVER (PARTITION BY region) AS regional_avg
FROM customer_spending
SELECT *
FROM region_avg
WHERE total_spent > regional_avg;
Output:
customer_id region total_spent regional_avg
2 North 1200 1100
customer_id region total_spent regional_avg
3 South 900 750
10. Find duplicate rows in an ingestion table (based on all
columns)
Objective:
Detect rows that are exact duplicates across all columns in a table (commonly found in
ingestion pipelines).
Sample Table: ingestion_data
id name email signup_date
Query:
SELECT name, email, signup_date, COUNT(*) AS duplicate_count
FROM ingestion_data
GROUP BY name, email, signup_date
HAVING COUNT(*) > 1;
Output:
name email signup_date duplicate_count
You can extend this by joining back to get all duplicate row ids if needed.
11. Compute daily revenue growth % using LAG window
function
Objective:
Compare each day’s revenue with the previous day and calculate percentage growth.
Sample Table: daily_revenue
revenue_date total_revenue
2024-07-01 1000
2024-07-02 1200
2024-07-03 900
2024-07-04 1500
Query:
SELECT
revenue_date,
total_revenue,
LAG(total_revenue) OVER (ORDER BY revenue_date) AS previous_day_revenue,
ROUND(
100.0 * (total_revenue - LAG(total_revenue) OVER (ORDER BY revenue_date))
/ NULLIF(LAG(total_revenue) OVER (ORDER BY revenue_date), 0),
) AS revenue_growth_pct
FROM daily_revenue;
Output:
revenue_date total_revenue previous_day_revenue revenue_growth_pct
2024-07-01 1000 NULL NULL
2024-07-02 1200 1000 20.00
2024-07-03 900 1200 -25.00
2024-07-04 1500 900 66.67
12. Identify products with declining sales 3 months in a row
Objective:
Detect products whose sales have been decreasing month-over-month for 3
consecutive months.
Sample Table: monthly_sales
product_id month monthly_sales
A 2024-01-01 500
A 2024-02-01 450
A 2024-03-01 400
A 2024-04-01 420
product_id month monthly_sales
B 2024-01-01 300
B 2024-02-01 320
B 2024-03-01 290
Query:
WITH sales_with_lag AS (
SELECT
product_id,
month,
monthly_sales,
LAG(monthly_sales, 1) OVER (PARTITION BY product_id ORDER BY month) AS
prev_month_1,
LAG(monthly_sales, 2) OVER (PARTITION BY product_id ORDER BY month) AS
prev_month_2
FROM monthly_sales
SELECT *
FROM sales_with_lag
WHERE monthly_sales < prev_month_1
AND prev_month_1 < prev_month_2;
Output:
product_id month monthly_sales prev_month_1 prev_month_2
A 2024-03-01 400 450 500
Explanation: Product A shows a consistent drop from Jan → Feb → Mar.
13. Get users with at least 3 logins per week over last 2
months
Objective:
From the login data, identify users who logged in 3 or more times per week, for at least
one or more weeks, in the last 2 months.
Sample Table: user_logins
user_id login_datetime
1 2024-06-03 10:00:00
1 2024-06-04 12:30:00
1 2024-06-07 08:45:00
1 2024-06-10 09:00:00
2 2024-06-02 11:00:00
2 2024-06-20 11:30:00
2 2024-06-22 13:00:00
2 2024-06-23 15:30:00
Query:
WITH recent_logins AS (
SELECT *
FROM user_logins
WHERE login_datetime >= DATEADD(MONTH, -2, CURRENT_DATE)
),
weekly_login_counts AS (
SELECT
user_id,
DATE_TRUNC('week', login_datetime) AS login_week,
COUNT(*) AS weekly_logins
FROM recent_logins
GROUP BY user_id, DATE_TRUNC('week', login_datetime)
SELECT DISTINCT user_id
FROM weekly_login_counts
WHERE weekly_logins >= 3;
Output:
user_id
14. Rank users by frequency of login in the current quarter
Objective:
Rank users by number of logins during the current calendar quarter (Q1/Q2/Q3/Q4).
Query:
WITH current_quarter_logins AS (
SELECT *
FROM user_logins
WHERE QUARTER(login_datetime) = QUARTER(CURRENT_DATE)
AND YEAR(login_datetime) = YEAR(CURRENT_DATE)
),
login_counts AS (
SELECT user_id, COUNT(*) AS total_logins
FROM current_quarter_logins
GROUP BY user_id
SELECT *,
RANK() OVER (ORDER BY total_logins DESC) AS login_rank
FROM login_counts;
Output:
user_id total_logins login_rank
1 20 1
3 17 2
2 9 3
15. Fetch users who purchased same product multiple times
in one day
Objective:
Identify users who purchased the same product more than once on the same day —
useful for detecting repeated purchases or anomalies.
Sample Table: purchases
user_id product_id purchase_datetime
101 A 2024-06-01 10:00:00
101 A 2024-06-01 14:00:00
102 B 2024-06-02 11:00:00
103 A 2024-06-03 10:00:00
103 A 2024-06-03 10:10:00
Query:
SELECT user_id, product_id, DATE(purchase_datetime) AS purchase_date, COUNT(*) AS
times_purchased
FROM purchases
GROUP BY user_id, product_id, DATE(purchase_datetime)
HAVING COUNT(*) > 1;
Output:
user_id product_id purchase_date times_purchased
101 A 2024-06-01 2
103 A 2024-06-03 2
16. Detect and delete late-arriving data for current month
partitions
Objective:
Late-arriving data refers to records arriving after the partition date they belong to. For
example, a record with event date = July 5 but arriving in August.
Assumption:
We have a table events_partitioned partitioned by event_date. The ingestion_date (or
load_timestamp) tells when the data actually landed.
Sample Table: events_partitioned
event_id event_date ingestion_date
1 2024-07-05 2024-07-05
2 2024-07-10 2024-08-01
3 2024-07-12 2024-07-12
Step 1: Detect Late-Arriving Records
SELECT *
FROM events_partitioned
WHERE MONTH(event_date) = MONTH(CURRENT_DATE)
AND ingestion_date > LAST_DAY(event_date);
This gives current month’s partitions where data came in after month-end.
Step 2: Delete Late-Arriving Data (for example in Delta Lake)
If you’re using Delta Lake, you can run:
DELETE FROM events_partitioned
WHERE MONTH(event_date) = MONTH(CURRENT_DATE)
AND ingestion_date > LAST_DAY(event_date);
Or for Hive/Databricks Spark SQL:
DELETE FROM events_partitioned
WHERE ingestion_date > LAST_DAY(event_date)
AND date_format(event_date, 'yyyy-MM') = date_format(current_date(), 'yyyy-MM');
In practice, archive these rows before deleting to ensure auditability.
17. Get top 5 products by profit margin across all categories
Objective:
Calculate profit margin per product and get top 5 products with highest margin,
regardless of category.
Sample Table: products
product_id product_name category selling_price cost_price
1 A Laptop 1000 700
2 B Phone 800 600
3 C Monitor 500 200
4 D Mouse 150 100
product_id product_name category selling_price cost_price
5 E Laptop 1200 800
6 F Tablet 600 580
Query:
SELECT product_id, product_name, category,
selling_price, cost_price,
ROUND((selling_price - cost_price) / cost_price * 100, 2) AS profit_margin_pct
FROM products
ORDER BY profit_margin_pct DESC
LIMIT 5;
Output:
product_id product_name category profit_margin_pct
3 C Monitor 150.00
5 E Laptop 50.00
1 A Laptop 42.86
2 B Phone 33.33
4 D Mouse 50.00
18. Compare rolling 30-day revenue vs previous 30-day
window
Objective:
For each day, compare revenue from current 30-day window with the previous 30-day
window, ideally in a time-series graph or table.
Sample Table: daily_revenue
revenue_date total_revenue
2024-06-01 1000
2024-06-02 1100
... ...
2024-07-15 1500
Query:
WITH revenue_base AS (
SELECT revenue_date, total_revenue
FROM daily_revenue
WHERE revenue_date >= DATEADD(DAY, -90, CURRENT_DATE)
),
rolling_windows AS (
SELECT
revenue_date,
-- Current 30-day rolling sum
SUM(total_revenue) OVER (
ORDER BY revenue_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) AS revenue_last_30,
-- Previous 30-day rolling sum (shifted)
SUM(total_revenue) OVER (
ORDER BY revenue_date
ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING
) AS revenue_prev_30
FROM revenue_base
SELECT *,
ROUND(100.0 * (revenue_last_30 - revenue_prev_30) / NULLIF(revenue_prev_30, 0), 2)
AS growth_pct
FROM rolling_windows
WHERE revenue_prev_30 IS NOT NULL
ORDER BY revenue_date;
Output:
revenue_date revenue_last_30 revenue_prev_30 growth_pct
2024-07-01 28000 25000 12.00
2024-07-02 28250 25300 11.64
... ... ... ...
19. Flag transactions happening outside business hours
Problem Statement:
Identify and flag transactions that occurred outside business hours. Assume business
hours are from 09:00 AM to 06:00 PM.
Example Table: transactions
transaction_id user_id transaction_time
101 1 2024-06-01 08:45:00
102 2 2024-06-01 10:15:00
103 3 2024-06-01 19:30:00
104 4 2024-06-01 14:00:00
Expected Output:
transaction_id user_id transaction_time is_outside_business_hours
101 1 2024-06-01 08:45:00 Yes
102 2 2024-06-01 10:15:00 No
103 3 2024-06-01 19:30:00 Yes
104 4 2024-06-01 14:00:00 No
SQL Solution (Standard SQL):
SELECT
transaction_id,
user_id,
transaction_time,
CASE
WHEN CAST(transaction_time AS TIME) < '09:00:00'
OR CAST(transaction_time AS TIME) > '18:00:00'
THEN 'Yes'
ELSE 'No'
END AS is_outside_business_hours
FROM transactions;
20. Write an optimized SQL query using broadcast join hints for
small lookup tables
Problem Statement:
In platforms like Databricks or Spark SQL, use broadcast join hints to optimize queries
when joining with small lookup tables.
Example Tables:
transactions (large table):
txn_id user_id product_id amount
1 101 P1 100
2 102 P2 150
products (small lookup table):
product_id product_name category
P1 Apple Fruit
P2 Milk Dairy
Goal:
Join the tables efficiently using a broadcast hint on the small products table.
SQL Solution (Spark SQL / Databricks):
SELECT /*+ BROADCAST(p) */
t.txn_id,
t.user_id,
t.product_id,
t.amount,
p.product_name,
p.category
FROM transactions t
JOIN products p
ON t.product_id = p.product_id;
• /*+ BROADCAST(p) */ hints Spark to broadcast the products table to all nodes —
very efficient when products is small.
• Reduces shuffle during join, improving performance.