SQL
Interview questions
for Data Analysts
Part I
linkedin.com/in/ileonjose
1. Median Number of Searches
Problem Statement: For an IPL ad
campaign, you need to determine the
median number of searches made by fans
last year. The data is stored in a summary
table with columns searches (indicating
the number of searches) and num_users
(indicating the number of users who
performed that many searches). Due to
the large size of the dataset, a direct
calculation of the median from the
summary table is not feasible.
linkedin.com/in/ileonjose
search_frequency
linkedin.com/in/ileonjose
How to Solve:
1. Expand Data:
Create a detailed list where each
search count is repeated according to
the number of users. For example, if
10 users made 5 searches each, the
list should include 10 entries of 5
searches.
1. Calculate Median:
Use the expanded list to find the
median value. The median is the
middle value when all entries are
ordered. If the number of entries is
even, the median is the average of the
two middle values.
linkedin.com/in/ileonjose
linkedin.com/in/ileonjose
2. Sum of Odd and Even Measurements
Problem Statement: You need to calculate
the sum of measurements taken at various
cricket matches, where measurements are
categorized by odd and even row numbers.
You have a table measurements with columns
measurement_time and measurement_value.
linkedin.com/in/ileonjose
Measurements
linkedin.com/in/ileonjose
How to Solve:
1. Assign Row Numbers:
Use the ROW_NUMBER() function to
assign a unique row number to each
measurement, partitioned by match
date and ordered by measurement
time.
2. Calculate Sums:
Use conditional aggregation to sum
measurements based on whether their
row number is odd or even.
linkedin.com/in/ileonjose
linkedin.com/in/ileonjose
3. Google Maps - Most Off-Topic UGC
Problem Statement: As a Data Analyst on
the Google Maps User Generated Content
team, you and your Product Manager are
investigating user-generated content
(UGC) – photos and reviews that
independent users upload to Google
Maps.
Identify which venue type (e.g.,
Restaurant, Bar) has the highest amount
of "off-topic" user-generated content
(UGC). You have two tables: place_info
(with place categories) and
maps_ugc_review (with UGC details).
linkedin.com/in/ileonjose
linkedin.com/in/ileonjose
How to Solve:
1. Count Off-Topic UGC:
Join place_info with maps_ugc_review
on place ID. Filter UGC to include only
those tagged as "Off-topic".
Count the occurrences of off-topic
UGC for each venue category.
2. Find Top Venue Category:
Determine which category has the
highest count of off-topic UGC.
linkedin.com/in/ileonjose
linkedin.com/in/ileonjose
4. Popular Search Categories
Problem Statement: Find the total
number of searches per category for the
year 2024, and group the results by
month. You have two tables: searches
(with search details) and categories (with
category names).
linkedin.com/in/ileonjose
Categories
Searches
linkedin.com/in/ileonjose
How to Solve:
1. Join Tables:
Combine the searches table with the
categories table to include category
names.
2. Count Searches:
Aggregate the number of searches by
category and month for the year 2024.
linkedin.com/in/ileonjose
linkedin.com/in/ileonjose
5. What is Database Denormalization?
Problem Statement: Explain the concept
of denormalization in database design.
linkedin.com/in/ileonjose
Denormalization is a database design
approach where tables are combined to
simplify the schema and improve query
performance.
This process involves introducing
redundancy by merging tables, which
reduces the need for complex joins and
can speed up read operations.
While it may increase data redundancy, it
can also improve performance and
simplify certain queries.
linkedin.com/in/ileonjose
Found this helpful? Repost!
linkedin.com/in/ileonjose