PwC Data Analyst Interview
PwC Data Analyst Interview
Guesstimate Questions:
1. Estimate the number of smartphones sold in India annually.
To guesstimate the annual number of smartphones sold in India, we can break the problem
into logical steps using assumptions and available population data. Here's a structured
approach:
Not everyone in India uses or purchases a smartphone. Let's segment the population:
• Assume 70% of the population is in the age group 15-60, which is the primary
smartphone user base.
1.4billion×70
• Assume 70% of this group can afford a smartphone or actively use one.
980million×70
Final Estimate:
Assumptions Recap:
1. 70% of the population is in the primary age group for smartphone users.
4. Annual new users account for ~10% of the total target population.
Refinement:
2. How would you estimate the daily revenue generated by roadside tea stalls across
India?
To estimate the daily revenue generated by roadside tea stalls across India, let's approach
this systematically using assumptions and logical reasoning:
Not everyone consumes tea from roadside stalls. Let’s segment the population:
• Assume 70% of the population (adults and older teens) regularly drink tea.
1.4billion×70
• Out of these, assume 50% of tea drinkers consume tea from roadside stalls (the rest
may make tea at home, go to cafes, or other sources).
980million×50
• Not all cups are purchased from roadside stalls; assume 1 cup per person per day
is bought from such stalls. 490millioncupsperday490 million cups per
day490millioncupsperday
• The average price of tea at roadside stalls is approximately ₹10 per cup.
Final Estimate:
The daily revenue generated by roadside tea stalls across India is approximately ₹4.9
billion.
Assumptions Recap:
Refinement:
• Factor in rural vs. urban consumption patterns (higher urban roadside tea stall
density).
seen = set()
pairs = set()
if complement in seen:
seen.add(num)
return list(pairs)
# Example usage:
target = 7
import string
def is_palindrome(s):
# Example usage:
3. Explain the difference between deep copy and shallow copy in Python. When would you
use each?
• Shallow Copy:
o Creates a new object but does not create copies of nested objects.
o Changes to mutable objects within the original will reflect in the copied
object.
• Deep Copy:
o Creates a new object along with copies of all objects it contains, recursively.
Example:
import copy
shallow = copy.copy(original)
deep = copy.deepcopy(original)
original[0][0] = 99
Use Cases:
• Use shallow copy when you want to duplicate a structure but allow shared mutable
data.
4.What are decorators in Python, and how do they work? Provide an example of a scenario
where a decorator would be useful.
Decorators in Python
Decorators are functions that modify the behavior of other functions or methods. They take
a function as input, add functionality to it, and return it.
Example of a Decorator:
def logger(func):
return result
return wrapper
@logger
def add(a, b):
return a + b
# Example usage:
print(add(3, 5))
Output:
csharp
Copy code
add returned 8
When to Use:
SQL Questions:
1. Write a query to find the cumulative revenue by month for each product category in
a sales table.
product_category VARCHAR(50),
revenue DECIMAL(10, 2),
sale_date DATE
);
SELECT
product_category,
SUM(revenue) AS monthly_revenue,
FROM
sales
GROUP BY
product_category, DATE_FORMAT(sale_date, '%Y-%m')
ORDER BY
product_category, month;
Explanation:
Sample Output:
product_name VARCHAR(50),
sales_volume INT,
sale_date DATE
);
WITH recent_sales AS (
SELECT
product_name,
SUM(sales_volume) AS total_sales,
FROM
product_sales
WHERE
GROUP BY
product_name
),
valid_products AS (
SELECT
product_name,
total_sales
FROM
recent_sales
WHERE
recent_sales_flag > 0
SELECT
product_name,
total_sales
FROM
valid_products
ORDER BY
total_sales DESC
LIMIT 5;
Explanation:
1. recent_sales CTE:
o Uses CASE to flag whether a product had non-zero sales in the past 3
months.
2. valid_products CTE:
o Filters out products with zero sales in all the past 3 months using
recent_sales_flag > 0.
3. Final Query:
o Orders the results in descending order of total sales and limits the output to
the top 5 products.
Expected Output:
Product_Name Total_Sales
Product E 1800
Product C 950
Product A 530
Product_Name Total_Sales
Product B 150
3. Given a table of customer transactions, identify all customers who made purchases in
two or more consecutive months.
To solve this, we'll assume the following table structure for customer transactions:
customer_id INT,
transaction_date DATE,
amount DECIMAL(10, 2)
);
WITH monthly_transactions AS (
SELECT
customer_id,
FROM
customer_transactions
GROUP BY
),
consecutive_months AS (
SELECT
t1.customer_id,
t1.transaction_month AS month1,
t2.transaction_month AS month2
FROM
monthly_transactions t1
JOIN
monthly_transactions t2
ON
t1.customer_id = t2.customer_id
)
SELECT DISTINCT
customer_id
FROM
consecutive_months;
Explanation:
1. monthly_transactions CTE:
2. consecutive_months CTE:
3. Final Query:
Sample Output:
Customer_ID
Notes:
• Customer 1: Purchased in January and February.
The retention rate is the percentage of users who return in a subsequent month after their
initial activity.
Assumptions
user_id INT,
activity_date DATE
);
(1, '2024-01-15'),
(1, '2024-02-10'),
(1, '2024-03-20'),
(2, '2024-01-20'),
(2, '2024-02-15'),
(3, '2024-02-05'),
(3, '2024-03-10'),
(4, '2024-01-25'),
(5, '2024-02-18'),
(5, '2024-03-15'),
(6, '2024-03-01');
WITH first_month_activity AS (
SELECT
user_id,
FROM
user_activity
GROUP BY
user_id
),
monthly_retention AS (
SELECT
fma.first_active_month,
FROM
user_activity ua
JOIN
first_month_activity fma
ON
ua.user_id = fma.user_id
GROUP BY
),
monthly_cohort AS (
SELECT
first_active_month,
FROM
first_month_activity
GROUP BY
first_active_month
SELECT
mr.first_active_month,
mr.active_month,
mr.retained_users,
mc.cohort_size,
FROM
monthly_retention mr
JOIN
monthly_cohort mc
ON
mr.first_active_month = mc.first_active_month
ORDER BY
mr.first_active_month, mr.active_month;
Explanation
1. first_month_activity CTE:
2. monthly_retention CTE:
o Counts the number of users retained for each combination of their first
active month and subsequent activity months.
3. monthly_cohort CTE:
o Calculates the size of the cohort for each first active month (the total number
of users who first became active in that month).
4. Final Query:
o Orders the results by the first active month and the active month.
Sample Output
Interpretation
• First Active Month: The cohort of users who became active in that month.
• Retention Rate: The percentage of the cohort that returned in subsequent months.
5. Find the nth highest salary from an employee table, where n is a parameter passed
dynamically to the query.
To find the nth highest salary dynamically, we can use a subquery with the LIMIT clause.
The query involves ranking salaries in descending order, skipping the first n−1salaries, and
then retrieving the nth salary. Here's how:
employee_name VARCHAR(50),
salary DECIMAL(10, 2)
);
FROM employees
LIMIT @n - 1, 1;
Explanation
1. @n Variable:
2. DISTINCT salary:
4. LIMIT @n - 1, 1:
o Skips the top n−1n-1n−1 salaries and retrieves the next one.
If the database supports window functions, you can use the DENSE_RANK() function:
SELECT
salary,
FROM
employees
SELECT salary
FROM ranked_salaries
1. DENSE_RANK():
2. WITH ranked_salaries:
Sample Output
Salary
75000.00
Key Notes
• Use the DISTINCT keyword to handle duplicate salaries for the LIMIT method.
6. Explain how indexing works in SQL and how to decide which columns should be indexed
for optimal performance.
An index is a database structure that improves the speed of data retrieval operations on a
table. It works like an optimized lookup table for the database, allowing it to quickly locate
rows without scanning the entire table.
• Structure: Most indexes are implemented as balanced tree structures (e.g., B-trees)
or hash tables. These structures allow efficient searching, insertion, and deletion
operations.
Types of Indexes
1. Primary Index:
o Ensures unique values and quick lookups for primary key operations.
2. Unique Index:
3. Clustered Index:
o Reorders the physical storage of table data to match the index order.
4. Non-clustered Index:
o Creates a separate structure to store the index and points to the table rows.
6. Full-Text Index:
o Optimized for searching text data, such as finding words or phrases in large
text fields.
Benefits of Indexing
• Faster Query Execution: Speeds up SELECT, JOIN, and WHERE clause operations.
• Reduced I/O Operations: Fewer rows are read from the disk.
Drawbacks of Indexing
• Slower Write Operations: INSERT, UPDATE, and DELETE operations become slower
because the index must also be updated.
o Index columns that appear frequently in WHERE, JOIN, ON, ORDER BY, or
GROUP BY clauses.
3. Foreign Keys:
4. High-Selectivity Columns:
o Choose columns with a wide range of unique values (e.g., a user_id column)
because indexes work best with high selectivity.
5. Composite Indexes:
o Use composite indexes when multiple columns are often queried together.
For example, for queries like:
A composite index on (year, region) will perform better than individual indexes.
o Avoid indexing columns with few distinct values (e.g., gender or status with
values like 'Active' or 'Inactive').
7. Read-Heavy Tables:
o Index columns in tables where SELECT operations are more frequent than
INSERT/UPDATE/DELETE.
Examples
1. EXPLAIN Plan:
o Use EXPLAIN to analyze how the database executes a query and whether it
uses an index.
3. Index Maintenance:
Summary
• Analyze query patterns and use tools like EXPLAIN to make data-driven decisions
about indexing.
7. Describe the differences between LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN and
when to use each one in a complex query.
Differences Between LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN
In SQL, JOIN operations combine rows from two or more tables based on a related column.
The differences among LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN lie in how
unmatched rows are handled.
1. LEFT JOIN
• Definition: Returns all rows from the left table and the matched rows from the right
table. If no match is found, the result contains NULL for columns from the right
table.
• Use Case: Use when you want all records from the left table regardless of whether
there is a match in the right table.
Syntax
SELECT columns
FROM table1
ON table1.common_column = table2.common_column;
Example
• Tables:
o Customers:
CustomerID Name
1 Alice
2 Bob
3 Charlie
o Orders:
OrderID CustomerID
101 1
102 2
• Query:
FROM Customers c
ON c.CustomerID = o.CustomerID;
• Result:
Name OrderID
Alice 101
Name OrderID
Bob 102
Charlie NULL
2. RIGHT JOIN
• Definition: Returns all rows from the right table and the matched rows from the left
table. If no match is found, the result contains NULL for columns from the left table.
• Use Case: Use when you want all records from the right table regardless of whether
there is a match in the left table.
Syntax
SELECT columns
FROM table1
ON table1.common_column = table2.common_column;
Example
• Query:
FROM Customers c
ON c.CustomerID = o.CustomerID;
• Result:
Name OrderID
Alice 101
Bob 102
3. FULL OUTER JOIN
• Definition: Combines the results of LEFT JOIN and RIGHT JOIN. Returns all rows
from both tables, with NULL in columns where no match exists.
• Use Case: Use when you want to include all records from both tables, showing
unmatched rows with NULL values.
Syntax
SELECT columns
FROM table1
ON table1.common_column = table2.common_column;
Example
• Query:
FROM Customers c
ON c.CustomerID = o.CustomerID;
• Result:
Name OrderID
Alice 101
Bob 102
Charlie NULL
1. LEFT JOIN:
o When the left table contains a primary set of data and you want to include all
rows, even if they have no matching data in the right table.
o Example: Listing all customers, including those who haven't made any
orders.
2. RIGHT JOIN:
o When the right table contains a primary set of data and you want to include
all rows, even if they have no matching data in the left table.
o When both tables are equally important, and you want to analyze all data
points, even unmatched rows.
Rows from Left Table Always Included Only if Matched Always Included
Visual Representation
If A represents rows from the left table and B represents rows from the right table:
• Use LEFT JOIN or RIGHT JOIN instead of FULL OUTER JOIN if you only need one
side's unmatched rows, as it reduces computation.
• Always use indexes on the columns used in the ON clause to improve performance
in joins.
8. What is the difference between HAVING and WHERE clauses in SQL, and when would
you use each?
1. WHERE Clause
• Use Case: Used to filter rows based on conditions applied to individual columns.
Syntax:
FROM table_name
WHERE condition;
Example:
FROM orders
• Explanation: Filters out rows before aggregation (i.e., filters orders with total_orders
> 50).
2. HAVING Clause
Syntax:
FROM table_name
GROUP BY column1
HAVING condition;
Example:
FROM orders
GROUP BY category
• Explanation: Filters the aggregated results (only categories with total_orders > 10
are included).
Key Differences
Scope Applied to the table rows individually Applied to the grouped results
o Example: Filtering customer records where the order count is more than 50.
o Example: Counting orders by category and filtering categories with more than
10 orders.
Practical Scenario
FROM orders
GROUP BY category
This will show categories with more than 10 orders placed after January 1, 2024.