0% found this document useful (0 votes)
3 views30 pages

SQL Interview @WithA

The document outlines SQL interview questions for Data Engineering, covering various topics such as active user counts, transaction rankings, time-series gap detection, and schema changes in Delta Lake. It includes sample tables, SQL queries, and expected outputs for each question, along with key concepts and objectives. The questions range from basic to intermediate difficulty, suitable for candidates with a salary expectation of 28-35 LPA.

Uploaded by

tishu335
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views30 pages

SQL Interview @WithA

The document outlines SQL interview questions for Data Engineering, covering various topics such as active user counts, transaction rankings, time-series gap detection, and schema changes in Delta Lake. It includes sample tables, SQL queries, and expected outputs for each question, along with key concepts and objectives. The questions range from basic to intermediate difficulty, suitable for candidates with a salary expectation of 28-35 LPA.

Uploaded by

tishu335
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

𝗡𝗧𝗧 𝗗𝗔𝗧𝗔

𝗦𝗤𝗟 questions in a Data Engineering interview


CTC → 𝟮𝟴-𝟯𝟱 𝗟𝗣𝗔
(Difficulty level = Intermediate)

1. Write a SQL query to get the daily count of active users


(logged in at least once).
Objective:

Find how many distinct users logged in on each date (i.e., were active).

Sample Table: user_logins

user_id login_datetime

101 2024-06-01 08:30:00

102 2024-06-01 09:15:00

101 2024-06-01 18:20:00

103 2024-06-02 10:45:00

102 2024-06-02 14:00:00

Query:

SELECT

DATE(login_datetime) AS login_date,

COUNT(DISTINCT user_id) AS active_user_count


FROM user_logins

GROUP BY DATE(login_datetime)

ORDER BY login_date;

Output:

login_date active_user_count

2024-06-01 2

2024-06-02 2

2. Find the 2nd highest transaction per user without using


LIMIT or TOP
Objective:

For each user, return their 2nd highest transaction amount (not using LIMIT or TOP).

Sample Table: transactions

user_id transaction_amount

1 500

1 800

1 1000

2 200

2 300

Query:

SELECT user_id, transaction_amount

FROM (
SELECT *,

DENSE_RANK() OVER (PARTITION BY user_id ORDER BY transaction_amount DESC)


AS txn_rank

FROM transactions

) ranked_txns

WHERE txn_rank = 2;

Output:

user_id transaction_amount

1 800

2 200

3. Identify data gaps in time-series event logs (e.g., missing


hourly records)
Objective:

Find hours where no events occurred, assuming continuous hourly logs are expected.

Sample Table: event_logs

event_time

2024-06-01 00:00:00

2024-06-01 01:00:00

2024-06-01 03:00:00

2024-06-01 04:00:00
Step-by-step:

1. Generate expected hourly time slots.

2. Left join event_logs on those.

3. Filter NULL (missing hours).

Query (for PostgreSQL / Snowflake / Databricks SQL):

-- Step 1: Define time range

WITH time_series AS (

SELECT generate_series(

TIMESTAMP '2024-06-01 00:00:00',

TIMESTAMP '2024-06-01 05:00:00',

INTERVAL '1 hour'

) AS expected_time

),

-- Step 2: Round actual event times to the hour

rounded_events AS (

SELECT date_trunc('hour', event_time) AS event_hour

FROM event_logs

-- Step 3: Find missing hours

SELECT ts.expected_time AS missing_hour

FROM time_series ts

LEFT JOIN rounded_events re

ON ts.expected_time = re.event_hour
WHERE re.event_hour IS NULL

ORDER BY missing_hour;

Output:

missing_hour

2024-06-01 02:00:00

2024-06-01 05:00:00

Summary Table

Question No Topic Key Concepts

1 Active Users Daily GROUP BY, COUNT(DISTINCT)

2 2nd Highest per Group DENSE_RANK(), OVER(PARTITION BY)

3 Time Series Gaps Detection generate_series, LEFT JOIN

4.⁠ Fetch the first purchase date per user and calculate days
since then .
Objective:

• Find each user's first purchase date.

• Calculate days passed since that purchase.

Sample Table: purchases

user_id purchase_date

101 2024-06-01
user_id purchase_date

101 2024-06-10

102 2024-06-05

102 2024-06-25

Query:

SELECT

user_id,

MIN(purchase_date) AS first_purchase_date,

DATEDIFF(CURRENT_DATE, MIN(purchase_date)) AS days_since_first_purchase

FROM purchases

GROUP BY user_id;

Output:

user_id first_purchase_date days_since_first_purchase

101 2024-06-01 47

102 2024-06-05 43

(Assuming today's date is 2024-07-18)

Alternate using CTE & Window Functions (if more detail is needed):

WITH ranked_purchases AS (

SELECT *,

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY purchase_date ASC) AS rn

FROM purchases
)

SELECT

user_id,

purchase_date AS first_purchase_date,

DATEDIFF(CURRENT_DATE, purchase_date) AS days_since_first_purchase

FROM ranked_purchases

WHERE rn = 1;

5.⁠ Detect schema changes in SCD Type 2 tables using Delta


Lake
Objective:

In SCD Type 2, every update creates a new version (row). You need to detect schema
drift—i.e., when columns are added, removed, or changed in Delta Lake over time.

Delta Lake Table Example: /mnt/datalake/scd_customer_data

Method 1: Using Delta’s DESCRIBE HISTORY

DESCRIBE HISTORY delta.`/mnt/datalake/scd_customer_data`;

➤ Output:

version timestamp operation operationMetrics userMetadata

3 2024-07-15 12:00:00 WRITE {“numOutputRows”: 20} added column “email”

Method 2: Compare Schema Across Versions

-- Load schema at version 1

DESCRIBE TABLE delta.`/mnt/datalake/scd_customer_data` VERSION AS OF 1;


-- Load schema at latest version

DESCRIBE TABLE delta.`/mnt/datalake/scd_customer_data`;

Manually compare or automate by reading schema as DataFrames:

# PySpark example

df_v1 = spark.read.format("delta").option("versionAsOf",
1).load("/mnt/datalake/scd_customer_data")

df_latest = spark.read.format("delta").load("/mnt/datalake/scd_customer_data")

# Compare schemas

set(df_v1.schema.names) ^ set(df_latest.schema.names)

This will show added/dropped columns → schema drift detection.

6.⁠ Join product and transaction tables and filter out null
foreign keys safely
Objective:

Avoid joining with NULL product_id in transaction data to prevent incorrect or missing join
results.

Sample Tables:

transactions

txn_id user_id product_id

1 101 501

2 102 NULL

3 103 502
products

product_id product_name

501 Laptop

502 Monitor

503 Keyboard

Bad Join (unsafe – includes NULLs):

SELECT *

FROM transactions t

LEFT JOIN products p

ON t.product_id = p.product_id;

→ This includes rows with NULL product_id, causing ambiguous or misleading results.

Safe Join (filtering nulls first):

SELECT *

FROM transactions t

JOIN products p

ON t.product_id = p.product_id

WHERE t.product_id IS NOT NULL;

Or, with LEFT JOIN to keep all transactions but highlight unmatched products:

SELECT t.*, p.product_name

FROM transactions t

LEFT JOIN products p

ON t.product_id = p.product_id
WHERE t.product_id IS NOT NULL;

Output:

txn_id user_id product_id product_name

1 101 501 Laptop

3 103 502 Monitor

Summary Table

Q No Topic Focus Area

4 First Purchase Date & Days Elapsed MIN(), DATEDIFF, GROUP BY

5 Schema Drift in Delta Lake (SCD2) DESCRIBE HISTORY, versioning

6 Safe Join on Foreign Keys JOIN, NULL handling

7. Get users who upgraded to premium within 7 days of


signup
Objective:

Find users who upgraded to premium within 7 days of their signup date.

Sample Table: users

user_id signup_date premium_upgrade_date

101 2024-01-01 2024-01-05

102 2024-02-01 2024-02-15


user_id signup_date premium_upgrade_date

103 2024-03-01 NULL

104 2024-04-01 2024-04-06

Query:

SELECT *

FROM users

WHERE premium_upgrade_date IS NOT NULL

AND DATEDIFF(premium_upgrade_date, signup_date) <= 7;

Output:

user_id signup_date premium_upgrade_date

101 2024-01-01 2024-01-05

104 2024-04-01 2024-04-06

8. Calculate cumulative distinct product purchases per


customer
Objective:

For each customer and each purchase date, calculate the cumulative count of distinct
products purchased so far.

Sample Table: purchases

customer_id product_id purchase_date

1 A 2024-01-01
customer_id product_id purchase_date

1 B 2024-01-03

1 A 2024-01-05

1 C 2024-01-06

2 B 2024-01-02

2 B 2024-01-04

Solution using Window Functions + Subquery:

WITH product_log AS (

SELECT DISTINCT customer_id, product_id, purchase_date

FROM purchases

),

running_distincts AS (

SELECT

p1.customer_id,

p1.purchase_date,

COUNT(DISTINCT p2.product_id) AS cumulative_distinct_products

FROM product_log p1

JOIN product_log p2

ON p1.customer_id = p2.customer_id

AND p2.purchase_date <= p1.purchase_date

GROUP BY p1.customer_id, p1.purchase_date

)
SELECT *

FROM running_distincts

ORDER BY customer_id, purchase_date;

Output:

customer_id purchase_date cumulative_distinct_products

1 2024-01-01 1

1 2024-01-03 2

1 2024-01-05 2

1 2024-01-06 3

2 2024-01-02 1

2 2024-01-04 1

9. Retrieve customers who spent above average in their


region
Objective:

Find customers who spent more than their regional average.

Sample Table: customer_spending

customer_id region total_spent

1 North 1000

2 North 1200

3 South 900
customer_id region total_spent

4 South 600

5 West 700

Query using AVG() OVER():

SELECT *,

AVG(total_spent) OVER (PARTITION BY region) AS regional_avg

FROM customer_spending

QUALIFY total_spent > AVG(total_spent) OVER (PARTITION BY region);

If QUALIFY is not supported (like in MySQL), use a subquery:

WITH region_avg AS (

SELECT *,

AVG(total_spent) OVER (PARTITION BY region) AS regional_avg

FROM customer_spending

SELECT *

FROM region_avg

WHERE total_spent > regional_avg;

Output:

customer_id region total_spent regional_avg

2 North 1200 1100


customer_id region total_spent regional_avg

3 South 900 750

10. Find duplicate rows in an ingestion table (based on all


columns)
Objective:

Detect rows that are exact duplicates across all columns in a table (commonly found in
ingestion pipelines).

Sample Table: ingestion_data

id name email signup_date

1 John [email protected] 2024-01-01

2 Alice [email protected] 2024-01-02

3 John [email protected] 2024-01-01

4 John [email protected] 2024-01-01

Query:

SELECT name, email, signup_date, COUNT(*) AS duplicate_count

FROM ingestion_data

GROUP BY name, email, signup_date

HAVING COUNT(*) > 1;

Output:
name email signup_date duplicate_count

John [email protected] 2024-01-01 3

You can extend this by joining back to get all duplicate row ids if needed.

11. Compute daily revenue growth % using LAG window


function
Objective:

Compare each day’s revenue with the previous day and calculate percentage growth.

Sample Table: daily_revenue

revenue_date total_revenue

2024-07-01 1000

2024-07-02 1200

2024-07-03 900

2024-07-04 1500

Query:

SELECT

revenue_date,

total_revenue,

LAG(total_revenue) OVER (ORDER BY revenue_date) AS previous_day_revenue,

ROUND(

100.0 * (total_revenue - LAG(total_revenue) OVER (ORDER BY revenue_date))


/ NULLIF(LAG(total_revenue) OVER (ORDER BY revenue_date), 0),

) AS revenue_growth_pct

FROM daily_revenue;

Output:

revenue_date total_revenue previous_day_revenue revenue_growth_pct

2024-07-01 1000 NULL NULL

2024-07-02 1200 1000 20.00

2024-07-03 900 1200 -25.00

2024-07-04 1500 900 66.67

12. Identify products with declining sales 3 months in a row


Objective:

Detect products whose sales have been decreasing month-over-month for 3


consecutive months.

Sample Table: monthly_sales

product_id month monthly_sales

A 2024-01-01 500

A 2024-02-01 450

A 2024-03-01 400

A 2024-04-01 420
product_id month monthly_sales

B 2024-01-01 300

B 2024-02-01 320

B 2024-03-01 290

Query:

WITH sales_with_lag AS (

SELECT

product_id,

month,

monthly_sales,

LAG(monthly_sales, 1) OVER (PARTITION BY product_id ORDER BY month) AS


prev_month_1,

LAG(monthly_sales, 2) OVER (PARTITION BY product_id ORDER BY month) AS


prev_month_2

FROM monthly_sales

SELECT *

FROM sales_with_lag

WHERE monthly_sales < prev_month_1

AND prev_month_1 < prev_month_2;

Output:
product_id month monthly_sales prev_month_1 prev_month_2

A 2024-03-01 400 450 500

Explanation: Product A shows a consistent drop from Jan → Feb → Mar.

13. Get users with at least 3 logins per week over last 2
months
Objective:

From the login data, identify users who logged in 3 or more times per week, for at least
one or more weeks, in the last 2 months.

Sample Table: user_logins

user_id login_datetime

1 2024-06-03 10:00:00

1 2024-06-04 12:30:00

1 2024-06-07 08:45:00

1 2024-06-10 09:00:00

2 2024-06-02 11:00:00

2 2024-06-20 11:30:00

2 2024-06-22 13:00:00

2 2024-06-23 15:30:00

Query:

WITH recent_logins AS (
SELECT *

FROM user_logins

WHERE login_datetime >= DATEADD(MONTH, -2, CURRENT_DATE)

),

weekly_login_counts AS (

SELECT

user_id,

DATE_TRUNC('week', login_datetime) AS login_week,

COUNT(*) AS weekly_logins

FROM recent_logins

GROUP BY user_id, DATE_TRUNC('week', login_datetime)

SELECT DISTINCT user_id

FROM weekly_login_counts

WHERE weekly_logins >= 3;

Output:

user_id

14. Rank users by frequency of login in the current quarter


Objective:
Rank users by number of logins during the current calendar quarter (Q1/Q2/Q3/Q4).

Query:

WITH current_quarter_logins AS (

SELECT *

FROM user_logins

WHERE QUARTER(login_datetime) = QUARTER(CURRENT_DATE)

AND YEAR(login_datetime) = YEAR(CURRENT_DATE)

),

login_counts AS (

SELECT user_id, COUNT(*) AS total_logins

FROM current_quarter_logins

GROUP BY user_id

SELECT *,

RANK() OVER (ORDER BY total_logins DESC) AS login_rank

FROM login_counts;

Output:

user_id total_logins login_rank

1 20 1

3 17 2

2 9 3
15. Fetch users who purchased same product multiple times
in one day
Objective:

Identify users who purchased the same product more than once on the same day —
useful for detecting repeated purchases or anomalies.

Sample Table: purchases

user_id product_id purchase_datetime

101 A 2024-06-01 10:00:00

101 A 2024-06-01 14:00:00

102 B 2024-06-02 11:00:00

103 A 2024-06-03 10:00:00

103 A 2024-06-03 10:10:00

Query:

SELECT user_id, product_id, DATE(purchase_datetime) AS purchase_date, COUNT(*) AS


times_purchased

FROM purchases

GROUP BY user_id, product_id, DATE(purchase_datetime)

HAVING COUNT(*) > 1;

Output:
user_id product_id purchase_date times_purchased

101 A 2024-06-01 2

103 A 2024-06-03 2

16. Detect and delete late-arriving data for current month


partitions
Objective:

Late-arriving data refers to records arriving after the partition date they belong to. For
example, a record with event date = July 5 but arriving in August.

Assumption:

We have a table events_partitioned partitioned by event_date. The ingestion_date (or


load_timestamp) tells when the data actually landed.

Sample Table: events_partitioned

event_id event_date ingestion_date

1 2024-07-05 2024-07-05

2 2024-07-10 2024-08-01

3 2024-07-12 2024-07-12

Step 1: Detect Late-Arriving Records

SELECT *

FROM events_partitioned

WHERE MONTH(event_date) = MONTH(CURRENT_DATE)

AND ingestion_date > LAST_DAY(event_date);


This gives current month’s partitions where data came in after month-end.

Step 2: Delete Late-Arriving Data (for example in Delta Lake)

If you’re using Delta Lake, you can run:

DELETE FROM events_partitioned

WHERE MONTH(event_date) = MONTH(CURRENT_DATE)

AND ingestion_date > LAST_DAY(event_date);

Or for Hive/Databricks Spark SQL:

DELETE FROM events_partitioned

WHERE ingestion_date > LAST_DAY(event_date)

AND date_format(event_date, 'yyyy-MM') = date_format(current_date(), 'yyyy-MM');

In practice, archive these rows before deleting to ensure auditability.

17. Get top 5 products by profit margin across all categories


Objective:

Calculate profit margin per product and get top 5 products with highest margin,
regardless of category.

Sample Table: products

product_id product_name category selling_price cost_price

1 A Laptop 1000 700

2 B Phone 800 600

3 C Monitor 500 200

4 D Mouse 150 100


product_id product_name category selling_price cost_price

5 E Laptop 1200 800

6 F Tablet 600 580

Query:

SELECT product_id, product_name, category,

selling_price, cost_price,

ROUND((selling_price - cost_price) / cost_price * 100, 2) AS profit_margin_pct

FROM products

ORDER BY profit_margin_pct DESC

LIMIT 5;

Output:

product_id product_name category profit_margin_pct

3 C Monitor 150.00

5 E Laptop 50.00

1 A Laptop 42.86

2 B Phone 33.33

4 D Mouse 50.00

18. Compare rolling 30-day revenue vs previous 30-day


window
Objective:
For each day, compare revenue from current 30-day window with the previous 30-day
window, ideally in a time-series graph or table.

Sample Table: daily_revenue

revenue_date total_revenue

2024-06-01 1000

2024-06-02 1100

... ...

2024-07-15 1500

Query:

WITH revenue_base AS (

SELECT revenue_date, total_revenue

FROM daily_revenue

WHERE revenue_date >= DATEADD(DAY, -90, CURRENT_DATE)

),

rolling_windows AS (

SELECT

revenue_date,

-- Current 30-day rolling sum

SUM(total_revenue) OVER (

ORDER BY revenue_date

ROWS BETWEEN 29 PRECEDING AND CURRENT ROW


) AS revenue_last_30,

-- Previous 30-day rolling sum (shifted)

SUM(total_revenue) OVER (

ORDER BY revenue_date

ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING

) AS revenue_prev_30

FROM revenue_base

SELECT *,

ROUND(100.0 * (revenue_last_30 - revenue_prev_30) / NULLIF(revenue_prev_30, 0), 2)


AS growth_pct

FROM rolling_windows

WHERE revenue_prev_30 IS NOT NULL

ORDER BY revenue_date;

Output:

revenue_date revenue_last_30 revenue_prev_30 growth_pct

2024-07-01 28000 25000 12.00

2024-07-02 28250 25300 11.64

... ... ... ...

19. Flag transactions happening outside business hours


Problem Statement:
Identify and flag transactions that occurred outside business hours. Assume business
hours are from 09:00 AM to 06:00 PM.

Example Table: transactions

transaction_id user_id transaction_time

101 1 2024-06-01 08:45:00

102 2 2024-06-01 10:15:00

103 3 2024-06-01 19:30:00

104 4 2024-06-01 14:00:00

Expected Output:

transaction_id user_id transaction_time is_outside_business_hours

101 1 2024-06-01 08:45:00 Yes

102 2 2024-06-01 10:15:00 No

103 3 2024-06-01 19:30:00 Yes

104 4 2024-06-01 14:00:00 No

SQL Solution (Standard SQL):

SELECT

transaction_id,

user_id,

transaction_time,

CASE
WHEN CAST(transaction_time AS TIME) < '09:00:00'

OR CAST(transaction_time AS TIME) > '18:00:00'

THEN 'Yes'

ELSE 'No'

END AS is_outside_business_hours

FROM transactions;

20. Write an optimized SQL query using broadcast join hints for
small lookup tables
Problem Statement:
In platforms like Databricks or Spark SQL, use broadcast join hints to optimize queries
when joining with small lookup tables.

Example Tables:

transactions (large table):

txn_id user_id product_id amount

1 101 P1 100

2 102 P2 150

products (small lookup table):

product_id product_name category

P1 Apple Fruit

P2 Milk Dairy

Goal:

Join the tables efficiently using a broadcast hint on the small products table.
SQL Solution (Spark SQL / Databricks):

SELECT /*+ BROADCAST(p) */

t.txn_id,

t.user_id,

t.product_id,

t.amount,

p.product_name,

p.category

FROM transactions t

JOIN products p

ON t.product_id = p.product_id;

• /*+ BROADCAST(p) */ hints Spark to broadcast the products table to all nodes —
very efficient when products is small.

• Reduces shuffle during join, improving performance.

You might also like