0% found this document useful (0 votes)
5 views

PwC Data Analyst Interview

The document outlines methods for estimating smartphone sales and daily revenue from roadside tea stalls in India, using logical steps and assumptions based on population data. It also includes Python programming tasks such as finding unique pairs that sum to a target, checking for palindromes, and explaining deep vs shallow copy. Additionally, it covers SQL queries for cumulative revenue, top products by sales volume, identifying customers with consecutive purchases, and calculating user retention rates.

Uploaded by

ronit.kumar2802
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

PwC Data Analyst Interview

The document outlines methods for estimating smartphone sales and daily revenue from roadside tea stalls in India, using logical steps and assumptions based on population data. It also includes Python programming tasks such as finding unique pairs that sum to a target, checking for palindromes, and explaining deep vs shallow copy. Additionally, it covers SQL queries for cumulative revenue, top products by sales volume, identifying customers with consecutive purchases, and calculating user retention rates.

Uploaded by

ronit.kumar2802
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

PWC DATA ANALYST EXPERIENCE (1-3 yoe)

Guesstimate Questions:
1. Estimate the number of smartphones sold in India annually.

To guesstimate the annual number of smartphones sold in India, we can break the problem
into logical steps using assumptions and available population data. Here's a structured
approach:

Step 1: Population of India

India's population is approximately 1.4 billion people.

Step 2: Target Population (Smartphone Users)

Not everyone in India uses or purchases a smartphone. Let's segment the population:

• Assume 70% of the population is in the age group 15-60, which is the primary
smartphone user base.
1.4billion×70

• Assume 70% of this group can afford a smartphone or actively use one.
980million×70

Step 3: Replacement Cycle and New Users

• On average, a smartphone is replaced every 2-3 years. This implies that


approximately one-third of smartphone users replace their phones annually.
686million÷3≈229million

• Assume an additional 5-10% of the target population (new users) buys a


smartphone each year. Using 10%: 686million×10

Step 4: Total Annual Sales

Adding the replacements and new users:


229million+68.6million≈300million

Final Estimate:

Approximately 300 million smartphones are sold annually in India.

Assumptions Recap:
1. 70% of the population is in the primary age group for smartphone users.

2. 70% of this segment can afford smartphones.

3. Replacement cycle is 2-3 years.

4. Annual new users account for ~10% of the total target population.

Refinement:

This estimate can be refined further with real data on:

• Smartphone penetration rates.

• Market growth rates.

• Urban vs. rural demographics.

2. How would you estimate the daily revenue generated by roadside tea stalls across
India?

To estimate the daily revenue generated by roadside tea stalls across India, let's approach
this systematically using assumptions and logical reasoning:

Step 1: Population of India

India's population is approximately 1.4 billion.

Step 2: Target Population for Tea Consumption

Not everyone consumes tea from roadside stalls. Let’s segment the population:

• Assume 70% of the population (adults and older teens) regularly drink tea.
1.4billion×70

• Out of these, assume 50% of tea drinkers consume tea from roadside stalls (the rest
may make tea at home, go to cafes, or other sources).
980million×50

Step 3: Daily Tea Consumption


• On average, tea drinkers consume 2 cups of tea daily.

• Not all cups are purchased from roadside stalls; assume 1 cup per person per day
is bought from such stalls. 490millioncupsperday490 million cups per
day490millioncupsperday

Step 4: Price of Tea

• The average price of tea at roadside stalls is approximately ₹10 per cup.

Step 5: Daily Revenue

• Multiply the daily consumption by the price per cup:


490millioncups×₹10=₹4.9billion

Final Estimate:

The daily revenue generated by roadside tea stalls across India is approximately ₹4.9
billion.

Assumptions Recap:

1. 70% of the population drinks tea.

2. 50% of tea drinkers buy from roadside stalls.

3. One cup per person is consumed daily at roadside stalls.

4. Average price per cup is ₹10.

Refinement:

To improve this estimate:

• Factor in rural vs. urban consumption patterns (higher urban roadside tea stall
density).

• Adjust for regional variations in tea prices and consumption habits.

• Account for occasional tea drinkers or seasonal demand changes.


Python Questions:
1. Write a Python function to find all unique pairs of integers in a list that sum up to a given
target value.

Find All Unique Pairs That Sum to a Target

def find_pairs(nums, target):

seen = set()

pairs = set()

for num in nums:

complement = target - num

if complement in seen:

pairs.add((min(num, complement), max(num, complement)))

seen.add(num)

return list(pairs)

# Example usage:

nums = [2, 4, 3, 7, 5, 8, -1]

target = 7

print(find_pairs(nums, target)) # Output: [(3, 4), (2, 5)]

2. Given a string, write a function to check if it’s a palindrome, ignoring spaces,


punctuation, and case sensitivity.

Check if a String Is a Palindrome

import string
def is_palindrome(s):

# Remove spaces, punctuation, and convert to lowercase

filtered = ''.join(c for c in s if c.isalnum()).lower()

return filtered == filtered[::-1]

# Example usage:

s = "A man, a plan, a canal, Panama!"

print(is_palindrome(s)) # Output: True

3. Explain the difference between deep copy and shallow copy in Python. When would you
use each?

Deep Copy vs. Shallow Copy

• Shallow Copy:

o Creates a new object but does not create copies of nested objects.

o Changes to mutable objects within the original will reflect in the copied
object.

o Example: Using copy.copy() or the copy() method of a list.

• Deep Copy:

o Creates a new object along with copies of all objects it contains, recursively.

o Changes to the original object do not affect the copied object.

o Example: Using copy.deepcopy().

Example:

import copy

original = [[1, 2], [3, 4]]

shallow = copy.copy(original)
deep = copy.deepcopy(original)

original[0][0] = 99

print(shallow) # Output: [[99, 2], [3, 4]]

print(deep) # Output: [[1, 2], [3, 4]]

Use Cases:

• Use shallow copy when you want to duplicate a structure but allow shared mutable
data.

• Use deep copy when creating a fully independent copy is necessary.

4.What are decorators in Python, and how do they work? Provide an example of a scenario
where a decorator would be useful.

Decorators in Python

Decorators are functions that modify the behavior of other functions or methods. They take
a function as input, add functionality to it, and return it.

Example of a Decorator:

def logger(func):

def wrapper(*args, **kwargs):

print(f"Calling {func.__name__} with {args} and {kwargs}")

result = func(*args, **kwargs)

print(f"{func.__name__} returned {result}")

return result

return wrapper

@logger
def add(a, b):

return a + b

# Example usage:

print(add(3, 5))

Output:

csharp

Copy code

Calling add with (3, 5) and {}

add returned 8

When to Use:

Decorators are useful for:

1. Logging: Automatically log function calls.

2. Authentication: Check user permissions before executing a function.

Caching: Store results of expensive computations for reuse .

SQL Questions:
1. Write a query to find the cumulative revenue by month for each product category in
a sales table.

Step 1: Create the Sales Table

CREATE TABLE sales (

id INT AUTO_INCREMENT PRIMARY KEY,

product_category VARCHAR(50),
revenue DECIMAL(10, 2),

sale_date DATE

);

Step 2: Insert Sample Records

INSERT INTO sales (product_category, revenue, sale_date) VALUES

('Electronics', 5000.00, '2024-01-15'),

('Electronics', 7000.00, '2024-02-10'),

('Electronics', 4000.00, '2024-03-05'),

('Clothing', 2000.00, '2024-01-20'),

('Clothing', 3000.00, '2024-02-15'),

('Clothing', 1500.00, '2024-03-01'),

('Groceries', 1000.00, '2024-01-10'),

('Groceries', 1200.00, '2024-02-12'),

('Groceries', 1300.00, '2024-03-08');

Step 3: Write the Query for Cumulative Revenue

SELECT

product_category,

DATE_FORMAT(sale_date, '%Y-%m') AS month,

SUM(revenue) AS monthly_revenue,

SUM(SUM(revenue)) OVER (PARTITION BY product_category ORDER BY


DATE_FORMAT(sale_date, '%Y-%m')) AS cumulative_revenue

FROM

sales

GROUP BY
product_category, DATE_FORMAT(sale_date, '%Y-%m')

ORDER BY

product_category, month;

Explanation:

1. DATE_FORMAT(sale_date, '%Y-%m'): Extracts the year and month from the


sale_date for grouping.

2. SUM(SUM(revenue)) OVER (PARTITION BY product_category ORDER BY


DATE_FORMAT(sale_date, '%Y-%m')): Calculates the cumulative revenue for each
product category by summing the monthly revenues in the specified order.

3. GROUP BY product_category, DATE_FORMAT(sale_date, '%Y-%m'): Groups the


data by product category and month.

Sample Output:

Product_Category Month Monthly_Revenue Cumulative_Revenue

Electronics 2024-01 5000.00 5000.00

Electronics 2024-02 7000.00 12000.00

Electronics 2024-03 4000.00 16000.00

Clothing 2024-01 2000.00 2000.00

Clothing 2024-02 3000.00 5000.00

Clothing 2024-03 1500.00 6500.00

Groceries 2024-01 1000.00 1000.00

Groceries 2024-02 1200.00 2200.00

Groceries 2024-03 1300.00 3500.00


2. How would you retrieve the top 5 products by sales volume, excluding any products that
had zero sales in the past 3 months?

Step 1: Create the Products Table

CREATE TABLE product_sales (

product_id INT AUTO_INCREMENT PRIMARY KEY,

product_name VARCHAR(50),

sales_volume INT,

sale_date DATE

);

Step 2: Insert Sample Records

INSERT INTO product_sales (product_name, sales_volume, sale_date) VALUES

('Product A', 150, '2024-10-01'),

('Product A', 200, '2024-11-01'),

('Product A', 180, '2024-12-01'),

('Product B', 100, '2024-10-01'),

('Product B', 0, '2024-11-01'),

('Product B', 50, '2024-12-01'),

('Product C', 250, '2024-10-15'),

('Product C', 300, '2024-11-15'),

('Product C', 400, '2024-12-15'),

('Product D', 0, '2024-10-10'),

('Product D', 0, '2024-11-10'),

('Product D', 0, '2024-12-10'),

('Product E', 500, '2024-10-05'),

('Product E', 600, '2024-11-05'),


('Product E', 700, '2024-12-05');

Step 3: Write the Query

WITH recent_sales AS (

SELECT

product_name,

SUM(sales_volume) AS total_sales,

MAX(CASE WHEN sale_date >= CURDATE() - INTERVAL 3 MONTH THEN sales_volume


ELSE 0 END) AS recent_sales_flag

FROM

product_sales

WHERE

sale_date >= CURDATE() - INTERVAL 3 MONTH

GROUP BY

product_name

),

valid_products AS (

SELECT

product_name,

total_sales

FROM

recent_sales

WHERE

recent_sales_flag > 0

SELECT
product_name,

total_sales

FROM

valid_products

ORDER BY

total_sales DESC

LIMIT 5;

Explanation:

1. recent_sales CTE:

o Calculates the total sales for each product.

o Uses CASE to flag whether a product had non-zero sales in the past 3
months.

2. valid_products CTE:

o Filters out products with zero sales in all the past 3 months using
recent_sales_flag > 0.

3. Final Query:

o Retrieves the top 5 products by total sales from valid_products.

o Orders the results in descending order of total sales and limits the output to
the top 5 products.

Expected Output:

Product_Name Total_Sales

Product E 1800

Product C 950

Product A 530
Product_Name Total_Sales

Product B 150

3. Given a table of customer transactions, identify all customers who made purchases in
two or more consecutive months.

To solve this, we'll assume the following table structure for customer transactions:

Step 1: Create the Transactions Table

CREATE TABLE customer_transactions (

transaction_id INT AUTO_INCREMENT PRIMARY KEY,

customer_id INT,

transaction_date DATE,

amount DECIMAL(10, 2)

);

Step 2: Insert Sample Records

INSERT INTO customer_transactions (customer_id, transaction_date, amount) VALUES

(1, '2024-01-15', 100.00),

(1, '2024-02-10', 200.00),

(1, '2024-04-05', 150.00),

(2, '2024-01-20', 300.00),

(2, '2024-02-15', 400.00),

(2, '2024-03-01', 500.00),

(3, '2024-01-25', 250.00),

(3, '2024-03-10', 300.00),

(4, '2024-02-05', 150.00),

(4, '2024-03-07', 200.00),


(4, '2024-04-15', 250.00);

Step 3: Write the Query

WITH monthly_transactions AS (

SELECT

customer_id,

DATE_FORMAT(transaction_date, '%Y-%m') AS transaction_month

FROM

customer_transactions

GROUP BY

customer_id, DATE_FORMAT(transaction_date, '%Y-%m')

),

consecutive_months AS (

SELECT

t1.customer_id,

t1.transaction_month AS month1,

t2.transaction_month AS month2

FROM

monthly_transactions t1

JOIN

monthly_transactions t2

ON

t1.customer_id = t2.customer_id

AND DATE_ADD(LAST_DAY(t1.transaction_month), INTERVAL 1 DAY) =


DATE(t2.transaction_month)

)
SELECT DISTINCT

customer_id

FROM

consecutive_months;

Explanation:

1. monthly_transactions CTE:

o Groups transactions by customer and month using


DATE_FORMAT(transaction_date, '%Y-%m').

o Ensures we have a unique list of months in which a customer made


purchases.

2. consecutive_months CTE:

o Joins monthly_transactions with itself to find customers with consecutive


months.

o Uses DATE_ADD(LAST_DAY(t1.transaction_month), INTERVAL 1 DAY) to


calculate the first day of the next month and checks if it matches
t2.transaction_month.

3. Final Query:

o Selects unique customer IDs from the consecutive_months CTE.

Sample Output:

Customer_ID

Notes:
• Customer 1: Purchased in January and February.

• Customer 2: Purchased in January, February, and March.

• Customer 4: Purchased in February, March, and April.

• Customer 3: Skipped February, so they are not included in the output.

4. Write a query to calculate the retention rate of users on a monthly basis.

Retention Rate Definition

The retention rate is the percentage of users who return in a subsequent month after their
initial activity.

Assumptions

• We have a table called user_activity with the following structure:

o user_id: Unique identifier for each user.

o activity_date: Date of the user's activity.

Step 1: Create the Table

CREATE TABLE user_activity (

user_id INT,

activity_date DATE

);

Step 2: Insert Sample Records

INSERT INTO user_activity (user_id, activity_date) VALUES

(1, '2024-01-15'),

(1, '2024-02-10'),

(1, '2024-03-20'),
(2, '2024-01-20'),

(2, '2024-02-15'),

(3, '2024-02-05'),

(3, '2024-03-10'),

(4, '2024-01-25'),

(5, '2024-02-18'),

(5, '2024-03-15'),

(6, '2024-03-01');

Step 3: Query to Calculate Retention Rate

WITH first_month_activity AS (

SELECT

user_id,

DATE_FORMAT(MIN(activity_date), '%Y-%m') AS first_active_month

FROM

user_activity

GROUP BY

user_id

),

monthly_retention AS (

SELECT

fma.first_active_month,

DATE_FORMAT(ua.activity_date, '%Y-%m') AS active_month,

COUNT(DISTINCT ua.user_id) AS retained_users

FROM

user_activity ua
JOIN

first_month_activity fma

ON

ua.user_id = fma.user_id

GROUP BY

fma.first_active_month, DATE_FORMAT(ua.activity_date, '%Y-%m')

),

monthly_cohort AS (

SELECT

first_active_month,

COUNT(DISTINCT user_id) AS cohort_size

FROM

first_month_activity

GROUP BY

first_active_month

SELECT

mr.first_active_month,

mr.active_month,

mr.retained_users,

mc.cohort_size,

ROUND((mr.retained_users / mc.cohort_size) * 100, 2) AS retention_rate

FROM

monthly_retention mr

JOIN

monthly_cohort mc
ON

mr.first_active_month = mc.first_active_month

ORDER BY

mr.first_active_month, mr.active_month;

Explanation

1. first_month_activity CTE:

o Determines the first active month for each user.

2. monthly_retention CTE:

o Counts the number of users retained for each combination of their first
active month and subsequent activity months.

3. monthly_cohort CTE:

o Calculates the size of the cohort for each first active month (the total number
of users who first became active in that month).

4. Final Query:

o Joins monthly_retention and monthly_cohort to calculate the retention rate


as: Retention Rate=(Retained UsersCohort Size)×100\text{Retention Rate} =
\left(\frac{\text{Retained Users}}{\text{Cohort Size}}\right) \times
100Retention Rate=(Cohort SizeRetained Users)×100

o Orders the results by the first active month and the active month.

Sample Output

First_Active_Month Active_Month Retained_Users Cohort_Size Retention_Rate

2024-01 2024-01 3 3 100.00

2024-01 2024-02 2 3 66.67

2024-01 2024-03 1 3 33.33


First_Active_Month Active_Month Retained_Users Cohort_Size Retention_Rate

2024-02 2024-02 3 3 100.00

2024-02 2024-03 2 3 66.67

2024-03 2024-03 1 1 100.00

Interpretation

• First Active Month: The cohort of users who became active in that month.

• Active Month: The months in which users returned.

• Retention Rate: The percentage of the cohort that returned in subsequent months.

5. Find the nth highest salary from an employee table, where n is a parameter passed
dynamically to the query.

To find the nth highest salary dynamically, we can use a subquery with the LIMIT clause.
The query involves ranking salaries in descending order, skipping the first n−1salaries, and
then retrieving the nth salary. Here's how:

Table Creation and Sample Data

CREATE TABLE employees (

employee_id INT PRIMARY KEY,

employee_name VARCHAR(50),

salary DECIMAL(10, 2)

);

INSERT INTO employees (employee_id, employee_name, salary) VALUES

(1, 'Alice', 60000.00),

(2, 'Bob', 75000.00),


(3, 'Charlie', 85000.00),

(4, 'David', 50000.00),

(5, 'Eve', 85000.00);

Query for nth Highest Salary

SET @n := 2; -- Set the value of n dynamically

SELECT DISTINCT salary

FROM employees

ORDER BY salary DESC

LIMIT @n - 1, 1;

Explanation

1. @n Variable:

o Dynamically sets the rank nnn for the desired salary.

2. DISTINCT salary:

o Ensures unique salaries are considered in case of duplicates.

3. ORDER BY salary DESC:

o Orders salaries in descending order, ranking the highest salary first.

4. LIMIT @n - 1, 1:

o Skips the top n−1n-1n−1 salaries and retrieves the next one.

Alternative Query Using Window Functions (MySQL 8.0+)

If the database supports window functions, you can use the DENSE_RANK() function:

SET @n := 2; -- Set the value of n dynamically


WITH ranked_salaries AS (

SELECT

salary,

DENSE_RANK() OVER (ORDER BY salary DESC) AS rank

FROM

employees

SELECT salary

FROM ranked_salaries

WHERE rank = @n;

Explanation (Window Functions)

1. DENSE_RANK():

o Assigns a unique rank to each salary in descending order. Duplicate salaries


get the same rank.

2. WITH ranked_salaries:

o Creates a temporary table with salaries and their respective ranks.

3. WHERE rank = @n:

o Filters the result to return only the nth rank.

Sample Output

For n=2n = 2n=2:

Salary

75000.00

Key Notes
• Use the DISTINCT keyword to handle duplicate salaries for the LIMIT method.

• Use DENSE_RANK() if you want to consider duplicate salaries as a single rank.

6. Explain how indexing works in SQL and how to decide which columns should be indexed
for optimal performance.

How Indexing Works in SQL

An index is a database structure that improves the speed of data retrieval operations on a
table. It works like an optimized lookup table for the database, allowing it to quickly locate
rows without scanning the entire table.

• Structure: Most indexes are implemented as balanced tree structures (e.g., B-trees)
or hash tables. These structures allow efficient searching, insertion, and deletion
operations.

• Function: When a query is executed, the database engine checks if an index is


available for the columns involved in the query’s filters or joins. If so, the engine
uses the index to locate the rows, reducing the need for a full table scan.

Types of Indexes

1. Primary Index:

o Automatically created for the primary key column.

o Ensures unique values and quick lookups for primary key operations.

2. Unique Index:

o Ensures that all values in the indexed column are unique.

3. Clustered Index:

o Reorders the physical storage of table data to match the index order.

o A table can have only one clustered index.

4. Non-clustered Index:

o Creates a separate structure to store the index and points to the table rows.

o A table can have multiple non-clustered indexes.


5. Composite Index:

o Indexes multiple columns together.

6. Full-Text Index:

o Optimized for searching text data, such as finding words or phrases in large
text fields.

Benefits of Indexing

• Faster Query Execution: Speeds up SELECT, JOIN, and WHERE clause operations.

• Reduced I/O Operations: Fewer rows are read from the disk.

• Sorted Data Retrieval: Helps with ORDER BY and GROUP BY clauses.

Drawbacks of Indexing

• Slower Write Operations: INSERT, UPDATE, and DELETE operations become slower
because the index must also be updated.

• Storage Overhead: Indexes consume additional disk space.

• Maintenance Overhead: Indexes need to be maintained, especially in tables with


frequent data modifications.

How to Decide Which Columns to Index

1. Frequently Queried Columns:

o Index columns that appear frequently in WHERE, JOIN, ON, ORDER BY, or
GROUP BY clauses.

2. Primary Keys and Unique Constraints:

o Always index primary key columns as they uniquely identify rows.

3. Foreign Keys:

o Index foreign key columns to improve JOIN performance.

4. High-Selectivity Columns:
o Choose columns with a wide range of unique values (e.g., a user_id column)
because indexes work best with high selectivity.

5. Composite Indexes:

o Use composite indexes when multiple columns are often queried together.
For example, for queries like:

SELECT * FROM sales WHERE year = 2023 AND region = 'North';

A composite index on (year, region) will perform better than individual indexes.

6. Avoid Low-Selectivity Columns:

o Avoid indexing columns with few distinct values (e.g., gender or status with
values like 'Active' or 'Inactive').

7. Read-Heavy Tables:

o Index columns in tables where SELECT operations are more frequent than
INSERT/UPDATE/DELETE.

Examples

Scenario 1: Searching by email in a user table

CREATE INDEX idx_email ON users(email);

• Improves performance for queries like:

SELECT * FROM users WHERE email = '[email protected]';

Scenario 2: Composite index for a sales table

CREATE INDEX idx_year_region ON sales(year, region);

• Optimizes queries with:

SELECT * FROM sales WHERE year = 2023 AND region = 'North';

Scenario 3: Indexing a foreign key

CREATE INDEX idx_customer_id ON orders(customer_id);

• Speeds up JOINs like:

SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id;


Monitoring and Tuning

1. EXPLAIN Plan:

o Use EXPLAIN to analyze how the database executes a query and whether it
uses an index.

2. Query Performance Metrics:

o Monitor slow queries and identify columns for potential indexing.

3. Index Maintenance:

o Periodically rebuild or reorganize indexes to ensure they remain efficient.

Summary

• Use indexes on frequently queried, high-selectivity columns.

• Avoid excessive indexing on write-heavy tables.

• Analyze query patterns and use tools like EXPLAIN to make data-driven decisions
about indexing.

7. Describe the differences between LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN and
when to use each one in a complex query.

Differences Between LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN

In SQL, JOIN operations combine rows from two or more tables based on a related column.
The differences among LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN lie in how
unmatched rows are handled.

1. LEFT JOIN

• Definition: Returns all rows from the left table and the matched rows from the right
table. If no match is found, the result contains NULL for columns from the right
table.

• Use Case: Use when you want all records from the left table regardless of whether
there is a match in the right table.
Syntax

SELECT columns

FROM table1

LEFT JOIN table2

ON table1.common_column = table2.common_column;

Example

• Tables:

o Customers:

CustomerID Name

1 Alice

2 Bob

3 Charlie

o Orders:

OrderID CustomerID

101 1

102 2

• Query:

SELECT c.Name, o.OrderID

FROM Customers c

LEFT JOIN Orders o

ON c.CustomerID = o.CustomerID;

• Result:

Name OrderID

Alice 101
Name OrderID

Bob 102

Charlie NULL

2. RIGHT JOIN

• Definition: Returns all rows from the right table and the matched rows from the left
table. If no match is found, the result contains NULL for columns from the left table.

• Use Case: Use when you want all records from the right table regardless of whether
there is a match in the left table.

Syntax

SELECT columns

FROM table1

RIGHT JOIN table2

ON table1.common_column = table2.common_column;

Example

• Query:

SELECT c.Name, o.OrderID

FROM Customers c

RIGHT JOIN Orders o

ON c.CustomerID = o.CustomerID;

• Result:

Name OrderID

Alice 101

Bob 102
3. FULL OUTER JOIN

• Definition: Combines the results of LEFT JOIN and RIGHT JOIN. Returns all rows
from both tables, with NULL in columns where no match exists.

• Use Case: Use when you want to include all records from both tables, showing
unmatched rows with NULL values.

Syntax

SELECT columns

FROM table1

FULL OUTER JOIN table2

ON table1.common_column = table2.common_column;

Example

• Query:

SELECT c.Name, o.OrderID

FROM Customers c

FULL OUTER JOIN Orders o

ON c.CustomerID = o.CustomerID;

• Result:

Name OrderID

Alice 101

Bob 102

Charlie NULL

When to Use Each Join in Complex Queries

1. LEFT JOIN:

o When the left table contains a primary set of data and you want to include all
rows, even if they have no matching data in the right table.
o Example: Listing all customers, including those who haven't made any
orders.

2. RIGHT JOIN:

o When the right table contains a primary set of data and you want to include
all rows, even if they have no matching data in the left table.

o Example: Listing all orders, including those made by unregistered


customers.

3. FULL OUTER JOIN:

o When both tables are equally important, and you want to analyze all data
points, even unmatched rows.

o Example: Creating a comprehensive report that includes all customers and


all orders, showing unmatched customers or orders.

Key Differences in a Nutshell

Feature LEFT JOIN RIGHT JOIN FULL OUTER JOIN

Rows from Left Table Always Included Only if Matched Always Included

Rows from Right


Only if Matched Always Included Always Included
Table

NULL in Right NULL in Left NULL in Both


Unmatched Rows
Columns Columns Columns

Visual Representation

If A represents rows from the left table and B represents rows from the right table:

• LEFT JOIN: A∪(A∩B)A \cup (A \cap B)A∪(A∩B)

• RIGHT JOIN: B∪(A∩B)B \cup (A \cap B)B∪(A∩B)

• FULL OUTER JOIN: A∪BA \cup BA∪B


Performance Tips

• Use LEFT JOIN or RIGHT JOIN instead of FULL OUTER JOIN if you only need one
side's unmatched rows, as it reduces computation.

• Always use indexes on the columns used in the ON clause to improve performance
in joins.

8. What is the difference between HAVING and WHERE clauses in SQL, and when would
you use each?

Difference Between HAVING and WHERE Clauses in SQL

1. WHERE Clause

• Purpose: Filters rows before any aggregation.

• Scope: Applied before any GROUP BY operation.

• Use Case: Used to filter rows based on conditions applied to individual columns.

Syntax:

SELECT column1, column2, ...

FROM table_name

WHERE condition;

Example:

SELECT customer_name, total_orders

FROM orders

WHERE total_orders > 50;

• Explanation: Filters out rows before aggregation (i.e., filters orders with total_orders
> 50).

2. HAVING Clause

• Purpose: Filters the aggregated results (after applying GROUP BY).

• Scope: Applied after the GROUP BY operation.


• Use Case: Used to filter aggregated data based on conditions applied to aggregate
functions like SUM, AVG, COUNT, etc.

Syntax:

SELECT column1, aggregate_function(column2)

FROM table_name

GROUP BY column1

HAVING condition;

Example:

SELECT category, COUNT(order_id) AS total_orders

FROM orders

GROUP BY category

HAVING total_orders > 10;

• Explanation: Filters the aggregated results (only categories with total_orders > 10
are included).

Key Differences

Feature WHERE Clause HAVING Clause

Purpose Filters rows before aggregation Filters aggregated results

Scope Applied to the table rows individually Applied to the grouped results

Usage Used to filter individual rows Used to filter aggregated results

Can apply conditions to non- Can apply conditions to aggregated


Conditions
aggregated columns (before grouping) columns (after grouping)

Example WHERE total_orders > 50 HAVING COUNT(order_id) > 10

When to Use Each

1. Use WHERE when:


o You need to filter rows based on conditions before performing any
aggregation.

o Example: Filtering customer records where the order count is more than 50.

2. Use HAVING when:

o You need to filter the results of an aggregation.

o Example: Counting orders by category and filtering categories with more than
10 orders.

Practical Scenario

-- Example using both WHERE and HAVING

SELECT category, COUNT(order_id) AS total_orders

FROM orders

WHERE order_date >= '2024-01-01' -- Filtering based on date before aggregation

GROUP BY category

HAVING total_orders > 10; -- Filtering aggregated results

This will show categories with more than 10 orders placed after January 1, 2024.

You might also like