0% found this document useful (0 votes)
11 views

Python - Pandas_Numpy Interview Q&A

The document outlines common interview questions and tasks for Data Analyst roles focusing on Python, Pandas, and Numpy. It includes coding tasks involving sales data, employee data, product sales data, and customer transaction data, along with verbal questions about data manipulation techniques. The document serves as a comprehensive guide for preparing for technical interviews in data analysis.

Uploaded by

Komal Chaudhari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Python - Pandas_Numpy Interview Q&A

The document outlines common interview questions and tasks for Data Analyst roles focusing on Python, Pandas, and Numpy. It includes coding tasks involving sales data, employee data, product sales data, and customer transaction data, along with verbal questions about data manipulation techniques. The document serves as a comprehensive guide for preparing for technical interviews in data analysis.

Uploaded by

Komal Chaudhari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Python - Pandas/Numpy Interview Q&A

Note: For Data Analyst role interviews, you will typically face two different kinds
of questions:

1. You will be asked to solve typical SQL interview questions using


Pandas/Numpy. I have listed the top 20 coding and 25 verbal questions
below for this.
2. Additionally, you will be asked basic to medium Python problem-solving
questions. For these, you can refer to the following link:
Link:
https://www.analyticsvidhya.com/articles/python-coding-interview-questions
/

---

DataFrame 1: Sales Data

Sample DataFrame:

Sales_ID | Salesperson_ID | Date | Sales_Amount


---------|----------------|------------|--------------
1 | 101 | 2023-01-10 | 1000
2 | 102 | 2023-02-15 | 1500
3 | 101 | 2023-02-18 | 2000
4 | 103 | 2023-01-20 | 2500
5 | 101 | 2023-03-01 | 1800
6 | 102 | 2023-03-11 | 2200
7 | 103 | 2023-03-14 | 2700

---

1. Calculate Month-over-Month (MoM) Sales Percentage Change


Task: Calculate the month-over-month percentage change in sales amounts for
the entire DataFrame.
Solution:
Convert the Date column to a monthly period, group by month, sum the sales
amounts, and compute the percentage change.

df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.to_period('M')
monthly_sales = df.groupby('Month')['Sales_Amount'].sum()
mom_change = monthly_sales.pct_change() * 100

---

2. Cumulative Sales by Salesperson


Task: Calculate the cumulative sales for each Salesperson_ID over time.
Solution:
Group the data by Salesperson_ID and apply the cumulative sum function on
Sales_Amount.

df['Cumulative_Sales'] =
df.groupby('Salesperson_ID')['Sales_Amount'].cumsum()

---

3. Categorize Sales into High, Medium, and Low Ranges


Task: Create a new column that categorizes sales into High, Medium, and Low
based on thresholds (Low < 1500, 1500 ≤ Medium < 2500, High ≥ 2500).
Solution:
Use pd.cut() to categorize the sales into defined bins.

bins = [0, 1500, 2500, float('inf')]


labels = ['Low', 'Medium', 'High']
df['Sales_Category'] = pd.cut(df['Sales_Amount'], bins=bins, labels=labels)

---

4. Find the Top 3 Salespersons by Total Sales


Task: Identify the top 3 salespersons based on their total sales.
Solution:
Group the data by Salesperson_ID and sum the Sales_Amount. Use nlargest() to
find the top 3.

top_3_salespersons =
df.groupby('Salesperson_ID')['Sales_Amount'].sum().nlargest(3)

---

5. Calculate the Rolling Average of Sales for Each Salesperson


Task: Calculate the 3-period rolling average of sales for each Salesperson_ID.
Solution:
Group by Salesperson_ID and apply the rolling function with a window of 3.

df['Rolling_Avg_Sales'] =
df.groupby('Salesperson_ID')['Sales_Amount'].rolling(window=3).mean().reset_in
dex(0, drop=True)

---

DataFrame 2: Employee Data

Sample DataFrame:

Emp_ID | Dept_ID | Join_Date | Projects_Completed | Salary


-------|---------|------------|--------------------|--------
1001 | 101 | 2020-01-10 | 5 | 50000
1002 | 102 | 2019-11-15 | 7 | 60000
1003 | 101 | 2021-02-18 | 4 | 55000
1004 | 103 | 2018-12-20 | 10 | 75000
1005 | 103 | 2022-03-01 | 3 | 48000

---
6. Find the Median Salary for Each Department
Task: Calculate the median salary for each department (Dept_ID).
Solution:
Group by Dept_ID and use the median() function on Salary.

median_salary = df.groupby('Dept_ID')['Salary'].median()

---

7. Count Employees Who Joined Before 2020


Task: Count the number of employees who joined before the year 2020.
Solution:
Convert the Join_Date to datetime and filter the data for dates before 2020.

df['Join_Date'] = pd.to_datetime(df['Join_Date'])
employees_before_2020 = df[df['Join_Date'] < '2020-01-01'].count()

---

8. Find the Employee with the Highest Number of Completed Projects


Task: Identify the employee who has completed the highest number of projects.
Solution:
Use idxmax() to find the row index with the highest Projects_Completed.

top_employee = df.loc[df['Projects_Completed'].idxmax()]

---

9. Normalize Salary Between 0 and 1


Task: Normalize the Salary column between 0 and 1.
Solution:
Apply min-max normalization to the Salary column.

df['Normalized_Salary'] = (df['Salary'] - df['Salary'].min()) / (df['Salary'].max() -


df['Salary'].min())
---

10. Rank Employees Based on Salary


Task: Rank employees based on their salary, with the highest salary getting rank
1.
Solution:
Use the rank() function on Salary in descending order.

df['Salary_Rank'] = df['Salary'].rank(ascending=False)

---

DataFrame 3: Product Sales Data

Sample DataFrame:

Product_ID | Category_ID | Sales_Q1 | Sales_Q2 | Sales_Q3 | Sales_Q4


-----------|-------------|----------|----------|----------|----------
201 |1 | 1000 | 1500 | 2000 | 2500
202 |2 | 5000 | 4000 | 3500 | 3000
203 |1 | 1200 | 1700 | 2200 | 2700
204 |3 | 8000 | 8500 | 9000 | 9500

---

11. Calculate Total Sales for Each Product


Task: Calculate the total sales across all four quarters for each product.
Solution:
Sum the sales columns for each product.

df['Total_Sales'] = df[['Sales_Q1', 'Sales_Q2', 'Sales_Q3',


'Sales_Q4']].sum(axis=1)

---
12. Calculate Year-over-Year Sales Growth
Task: Calculate the percentage growth from Q1 to Q4 for each product.
Solution:
Compute the growth percentage using the formula (Sales_Q4 - Sales_Q1) /
Sales_Q1 * 100.

df['YoY_Growth'] = (df['Sales_Q4'] - df['Sales_Q1']) / df['Sales_Q1'] * 100

---

13. Reshape the Data to Long Format


Task: Reshape the DataFrame from wide format to long format where each row
represents one product and one quarter's sales.
Solution:
Use the pd.melt() function to unpivot the DataFrame.

df_long = pd.melt(df, id_vars=['Product_ID'], value_vars=['Sales_Q1',


'Sales_Q2', 'Sales_Q3', 'Sales_Q4'], var_name='Quarter', value_name='Sales')

---

14. Calculate the Mean Sales for Each Category


Task: Calculate the average sales for each product category across all four
quarters.
Solution:
Group by Category_ID and compute the mean for the sales columns.

mean_sales = df.groupby('Category_ID')[['Sales_Q1', 'Sales_Q2', 'Sales_Q3',


'Sales_Q4']].mean()

---

15. Rank Products Based on Their Total Sales


Task: Rank products based on their total sales, with the highest sales receiving
rank 1.
Solution:
Use the rank() function on Total_Sales.

df['Sales_Rank'] = df['Total_Sales'].rank(ascending=False)

---

DataFrame 4: Customer Transaction Data

Sample DataFrame:

Transaction_ID | Customer_ID | Product_ID | Date | Amount_Spent


---------------|-------------|------------|------------|--------------
1 | 1001 | 201 | 2023-01-10 | 150
2 | 1002 | 202 | 2023-01-12 | 250
3 | 1001 | 203 | 2023-01-15 | 350
4 | 1003 | 204 | 2023-02-10 | 450
5 | 1002 | 202 | 2023-02-15 | 200

---

16. Calculate Total Spend by Each Customer


Task: Calculate the total amount spent by each customer.
Solution:
Group by Customer_ID and sum the Amount_Spent.

total_spend = df.groupby('Customer_ID')['Amount_Spent'].sum()

---

17. Find Customers Who Made Multiple Purchases


Task: Identify customers who made more than one
purchases.
Solution:
Group by Customer_ID and filter to find customers with more than one
transaction.

customers_multiple_purchases = df.groupby('Customer_ID').filter(lambda x:
len(x) > 1)

---

18. Calculate Time Difference Between Consecutive Transactions


Task: Calculate the time difference in days between consecutive transactions for
each customer.
Solution:
Sort the data by Customer_ID and Date, then use the diff() function to find the
difference in days.

df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(by=['Customer_ID', 'Date'])
df['Days_Between'] = df.groupby('Customer_ID')['Date'].diff().dt.days

---

19. Find the Most Popular Product Based on Sales


Task: Identify the product that has generated the highest total sales amount.
Solution:
Group by Product_ID and sum the Amount_Spent, then use idxmax() to find the
product with the highest sales.

top_product = df.groupby('Product_ID')['Amount_Spent'].sum().idxmax()

---

20. Add a Column for Running Total of Spend Per Customer


Task: Create a running total column for the amount spent by each customer.
Solution:
Group by Customer_ID and apply the cumsum() function on Amount_Spent.

df['Running_Total'] = df.groupby('Customer_ID')['Amount_Spent'].cumsum()

25 verbally asked Pandas and NumPy interview questions

1. What is the difference between `loc[]` and `iloc[]` in Pandas?


- `loc[]` is label-based indexing, while `iloc[]` is position-based indexing.

2. How do you handle missing values in a Pandas DataFrame?


- You can use methods like `dropna()`, `fillna()`, or `interpolate()` to handle
missing values.

3. How do you merge two DataFrames in Pandas?


- You can use `merge()`, `join()`, or `concat()` depending on the type of merging
required (e.g., inner, outer, left, right join).

4. What is the purpose of `groupby()` in Pandas?


- `groupby()` groups data by one or more columns and then applies aggregation
functions to those groups.

5. How do you calculate a rolling average in Pandas?


- You can use the `rolling()` function followed by an aggregation function like
`mean()` to calculate the rolling average.

6. What is the difference between `apply()` and `map()` in Pandas?


- `apply()` is used to apply a function across a DataFrame or Series along either
axis, whereas `map()` is used for element-wise substitutions on a Series.

7. How do you convert a column to a categorical type in Pandas?


- You can use `pd.Categorical()` or `astype('category')` to convert a column to a
categorical type.

8. How do you handle time-series data in Pandas?


- Time-series data can be handled using functions like `pd.to_datetime()`,
resampling with `resample()`, and using time-based indexing.

9. What is the purpose of the `pivot()` function in Pandas?


- The `pivot()` function is used to reshape data by changing rows into columns or
vice versa, based on values from other columns.

10. How do you concatenate two DataFrames in Pandas?


- You can concatenate DataFrames using the `concat()` function, either vertically
(`axis=0`) or horizontally (`axis=1`).

11. How do you handle large datasets efficiently in Pandas?


- You can optimize memory usage with `dtype` parameters, process data in
chunks, or use libraries like Dask or Modin for parallel processing.

12. How do you remove duplicate rows in a Pandas DataFrame?


- You can remove duplicates using the `drop_duplicates()` method.

13. What is a MultiIndex in Pandas and how do you create one?


- A MultiIndex is a hierarchical index allowing multiple levels of indexing. You
can create it using `set_index()` or by passing multiple arrays to the `index`
parameter.

14. What are the key differences between Pandas and NumPy?
- Pandas provides higher-level data structures (Series, DataFrame) with labeled
axes and is more suited for handling structured data, while NumPy is focused on
efficient numerical computations with arrays.

15. How do you compute the correlation between columns in Pandas?


- You can use the `corr()` function to compute the Pearson correlation coefficient
between columns in a DataFrame.
16. What is the purpose of the `cut()` function in Pandas?
- The `cut()` function is used to segment and sort data into discrete bins, which is
useful for converting continuous variables into categorical ones.

17. How do you resample time-series data in Pandas?


- The `resample()` function allows you to change the frequency of time-series
data, for example, converting daily data into monthly data.

18. How do you filter rows based on a condition in Pandas?


- You can filter rows by using conditional statements like `df[df['column'] >
value]`.

19. What is the purpose of the `apply()` function in Pandas?


- The `apply()` function is used to apply a function along an axis (rows or
columns) of a DataFrame or Series.

20. How do you perform element-wise operations in NumPy?


- NumPy supports element-wise operations like addition, subtraction, and
multiplication directly between arrays of the same shape.

21. What is broadcasting in NumPy?


- Broadcasting is a technique that allows NumPy to perform arithmetic
operations on arrays of different shapes, automatically expanding the smaller array.

22. How do you calculate the cumulative sum in Pandas?


- You can use the `cumsum()` function to compute the cumulative sum of a
column or Series.

23. How do you reshape a NumPy array?


- You can reshape a NumPy array using the `reshape()` method, which changes
the dimensions of an array without changing its data.

24. What is the purpose of the `np.dot()` function in NumPy?


- The `np.dot()` function performs matrix multiplication between two arrays.
25. How do you find the index of the maximum value in a NumPy array?
- You can use the `np.argmax()` function to find the index of the maximum value
in a NumPy array.

You might also like