0% found this document useful (0 votes)
5 views7 pages

SQL Optimization

The document discusses the differences between UNION and UNION ALL in SQL, emphasizing that UNION removes duplicates and is slower, while UNION ALL retains duplicates and is faster. It advises using UNION ALL unless deduplication is necessary, and provides optimization tips to avoid expensive sorting operations. Real-world examples illustrate the performance benefits of using UNION ALL over UNION in large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

SQL Optimization

The document discusses the differences between UNION and UNION ALL in SQL, emphasizing that UNION removes duplicates and is slower, while UNION ALL retains duplicates and is faster. It advises using UNION ALL unless deduplication is necessary, and provides optimization tips to avoid expensive sorting operations. Real-world examples illustrate the performance benefits of using UNION ALL over UNION in large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SQL OPTIMIZATION

INTERVIEW
QUESTION
Interview Question:

Q: What’s the difference between UNION and UNION


ALL, and when should you use each?
Asked by: TCS Digital, Cognizant GenC Next

✅ Answer:
 UNION: Removes duplicates → slower due to sorting
 UNION ALL: Keeps duplicates → faster and cheaper
 Prefer UNION ALL unless your business case requires
deduplication

Why it matters:

Both DISTINCT and UNION perform sorting and


deduplication, which are expensive operations —
especially with millions of rows. This leads to unnecessary
CPU usage, memory pressure, and longer runtimes in
large-scale data systems (e.g., Azure Synapse, BigQuery,
Spark).
Optimization Tips:

Use UNION ALL instead of UNION when duplicates


don’t affect results.
Avoid DISTINCT unless you’re solving a real business
need.
Investigate duplicates first — don’t assume they're
there.
Check execution plans to confirm if a sort/shuffle is
involved.
Examples:

Slower Query:
SELECT DISTINCT city FROM customers;

✅ Faster (no deduplication needed):


SELECT city FROM customers;

Union with deduplication:


SELECT city FROM customers_2023
UNION
SELECT city FROM customers_2024;

✅ Union All - better performance:


SELECT city FROM customers_2023
UNION ALL
SELECT city FROM customers_2024;
Best Practice:

 ✅ Prefer UNION ALL over UNION when duplicates


do not impact business logic — it's significantly
faster.

 ❌ Avoid using UNION by default — it performs an


implicit sort and deduplication, which is resource-
intensive.

 ✅ If you must remove duplicates, consider whether


upstream data cleansing or filtering can handle it
instead.

 ✅ Use EXISTS, JOIN, or ROW_NUMBER() +


FILTER for more controlled deduplication when
needed.

 Always check the query plan — UNION often


triggers sort or shuffle operations that slow down
performance, especially in Spark or distributed SQL
engines.
Real-World Example:

You’re combining order data from two e-commerce


platforms:

 orders_amazon → 5 million rows


 orders_ebay → 4 million rows

❌ Inefficient Query:
SELECT order_id, customer_id FROM orders_amazon
UNION
SELECT order_id, customer_id FROM orders_ebay;

Result: Full deduplication via sort → high memory


usage, query runs in ~30 seconds on large clusters

✅ Optimized Query:
SELECT order_id, customer_id FROM orders_amazon
UNION ALL
SELECT order_id, customer_id FROM orders_ebay;
Result: No sort → query runs in ~6 seconds
Improves scalability and reduces cost in cloud
environments like BigQuery, Snowflake, or Azure Synapse

You might also like