SQL Interview Questions Goldman Sachs
SQL Interview Questions Goldman Sachs
INTERVIEW QUESTIONS
--------------------------------------------------------
EXPERIENCE : 0-3 YOE
CTC : 23 LPA
Power BI Solution:
Tableau Solution:
• Best for comparing multiple categories (e.g., Sales performance across regions and
months).
• Typically used in Power BI with Conditional Formatting or in Tableau using Density
Marks.
Scatter Plot:
• Used to show correlation between two numerical variables (e.g., Revenue vs.
Advertising Spend).
Power BI Approach:
2. Use Card Visuals to display key KPIs (e.g., Total Sales, Profit Margin).
5. Apply Slicers & Filters to allow users to interact with the data.
Tableau Approach:
[Profit] / [Sales]
Formula:
If X1,X2,...,Xn are i.i.d. random variables with mean μ and variance σ2 , then the sample
mean:
Xˉ=(X1+X2+...+Xn)/n
Importance of CLT:
• Enables the use of normal distribution for hypothesis testing and confidence
intervals.
• Justifies why sample means are normally distributed, even when the population is
not.
Example:
3 .Shapiro-Wilk Test:
• Hypothesis:
stat, p = shapiro(data)
4 .Anderson-Darling Test:
• Another test to check normality; similar to Shapiro-Wilk but more powerful for larger
datasets.
• Kurtosis measures the tail thickness (should be close to 3 for normal distribution).
print("Skewness:", skew(data))
print("Kurtosis:", kurtosis(data))
Conclusion:
• If most tests indicate normality, we assume the data follows a normal distribution.
Scale
Independent of units Depends on units
Dependence
Example:
Consider two stocks, Stock A and Stock B:
import numpy as np
# Covariance
cov_matrix = np.cov(A, B)
# Correlation
corr_matrix = np.corrcoef(A, B)
Key Takeaway:
Type I Error (False Rejecting a true null hypothesis False alarm (e.g., convicting an
Positive) (H0H_0H0). innocent person).
Type II Error (False Failing to reject a false null Missing a real effect (e.g., letting a
Negative) hypothesis (H0H_0H0). guilty person go free).
Example:
Consider a COVID-19 Test:
• Type I Error: Healthy person is diagnosed as COVID-positive (false positive).
Formula:
Reducing Errors:
2 .Collect Data:
• Flip the coin 100 times and record number of heads (X).
print("P-value:", p_value)
4 .Interpret Results:
• If p-value < 0.05, reject H₀ (coin is biased).
Conclusion:
Efficient Methods:
• This reads the file in smaller chunks instead of loading everything into memory at
once.
import pandas as pd
• If you don’t need all columns, load only the required ones.
import dask.dataframe as dd
df = dd.read_csv("large_file.csv")
Best Practice:
df["new_column"] = df["existing_column"].apply(lambda x: x * 2)
• Uses NumPy operations, which are much faster than loops or apply().
Performance Comparison:
Method Speed Works on
Conclusion:
3. Given a dataset, how would you detect and handle missing values?
Step 1: Detect Missing Values
import pandas as pd
df = pd.read_csv("data.csv")
df["department"].fillna(df["department"].mode()[0], inplace=True)
Best Practice:
Types of Joins:
Example:
import pandas as pd
df1 = pd.DataFrame({
})
df2 = pd.DataFrame({
"ID": [1, 2, 4],
})
print(df_merged)
Best Practice:
• Use how="left" if you don’t want to lose data from the main DataFrame.
Python Code:
import numpy as np
import pandas as pd
# Sample dataset
data = {"Salary": [40000, 45000, 60000, 70000, 80000, 120000, 150000, 500000]}
df = pd.DataFrame(data)
Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)
IQR = Q3 - Q1
# Compute bounds
# Identify outliers
print("Outliers:\n", outliers)
Best Practice:
SQL
1. Write an SQL query to find the second-highest salary from an
employee table.
Solution:
To find the second-highest salary, we can use various methods such as using LIMIT,
subqueries, or window functions.
Approach 1: Using LIMIT and OFFSET (For MySQL and PostgreSQL)
SELECT salary
FROM employees
LIMIT 1 OFFSET 1;
• This query orders the salaries in descending order and skips the highest salary using
OFFSET 1, returning the second-highest salary.
FROM employees
• This query first selects the highest salary using a subquery and then finds the
maximum salary that is smaller than the highest salary, which is the second-highest
salary.
Using ROW_NUMBER() (This works in SQL Server, PostgreSQL, MySQL 8.0+, etc.)
WITH ranked_rows AS (
FROM table_name
Using GROUP BY
FROM table_name
• This query groups by the columns you want to remove duplicates for and returns
distinct values for each group.
A running total (or cumulative sum) can be calculated using a window function like SUM()
with an OVER() clause.
FROM sales_table;
• This query calculates the cumulative sum of sales month by month using
SUM(sales) and orders the rows by month. The OVER() clause tells SQL to apply the
SUM() function across the ordered rows to get the running total.
Example:
• INNER JOIN
SELECT *
FROM table1
o Returns all rows from the left table and matching rows from the right table.
o If there is no match in the right table, the result will include NULL for columns
from the right table.
SELECT *
FROM table1
o Returns all rows from the right table and matching rows from the left table.
o If there is no match in the left table, the result will include NULL for columns
from the left table.
SELECT *
FROM table1
o Returns all rows from both tables, with NULL for non-matching rows.
SELECT *
FROM table1
FULL JOIN table2 ON table1.id = table2.id;
Example:
ID Name Department
1 Alice HR
2 Bob IT
For a LEFT JOIN with a table that contains a department and salary column, it would return:
1 Alice HR 5000
2 Bob IT NULL
To find the top 3 customers with the highest spending, we can use GROUP BY to sum the
spending for each customer and then order the result by the total spending in descending
order.
FROM transactions
GROUP BY customer_id
LIMIT 3;
• This query groups the transactions by customer_id, calculates the total spending for
each customer, orders the results by total_spending in descending order, and limits
the output to the top 3 customers.
Example:
Customer_ID Total_Spending
101 5000
102 4500
103 4000
JOIN
Description
Type
INNER
Only rows that match in both tables are returned.
JOIN
LEFT All rows from the left table and matching rows from the right table; non-
JOIN matching rows have NULL values from the right table.
RIGHT All rows from the right table and matching rows from the left table; non-
JOIN matching rows have NULL values from the left table.
FULL All rows from both tables; non-matching rows have NULL values in the columns
JOIN where there is no match.