Pandas Interview Questions & Answer
Data Insight
Sample Data for Analysis
In [1]: import pandas as pd
import numpy as np
# Set seed for reproducibility
np.random.seed(42)
# Create a range of dates
dates = pd.date_range(start="2023-01-01", end="2025-01-01", freq='D')
n = len(dates)
# Generate sample data
df = pd.DataFrame({
'date': dates,
'sales': np.random.randint(100, 1000, size=n),
'product': np.random.choice(['Product A', 'Product B', 'Product C'], size=n
'department': np.random.choice(['HR', 'Sales', 'IT', 'Operations'], size=n),
'processing_time': np.random.normal(loc=5, scale=2, size=n).clip(1, 15),
'customer_id': np.random.randint(1, 500, size=n),
'initiative': np.random.choice(['None', 'New Campaign'], size=n, p=[0.8, 0.2
'revenue': np.random.uniform(200, 1000, size=n),
'cost': np.random.uniform(100, 500, size=n)
})
# Save the dataset as CSV
df.to_csv("sample_business_data.csv", index=False)
print("Sample dataset saved as 'sample_business_data.csv'")
Sample dataset saved as 'sample_business_data.csv'
1. How do you identify trends in a dataset using
Pandas?
In [5]: import pandas as pd
# Filepath to your CSV
filepath = r'D:\sales_data.csv'
# Step 1: Read the file and parse dates
df = pd.read_csv(filepath)
# Step 2: Convert 'date' to datetime (just to be sure)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# Step 3: Set the datetime column as index
df.set_index('date', inplace=True)
# Step 4: Confirm the index type
print(type(df.index)) # Should show DatetimeIndex
# Step 5: Now you can safely resample
monthly_trend = df['sales'].resample('M').mean()
print(monthly_trend.tail())
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
date
2024-09-30 593.333333
2024-10-31 564.833333
2024-11-30 400.750000
2024-12-31 354.916667
2025-01-31 495.000000
Name: sales, dtype: float64
2. How do you identify correlations between
columns in a DataFrame?
In [7]: # Select only numeric columns
numeric_df = df.select_dtypes(include='number')
# Now compute the correlation matrix
correlation_matrix = numeric_df.corr()
print(correlation_matrix)
sales processing_time customer_id revenue cost
sales 1.000000 -0.014953 -0.047259 0.006676 -0.005750
processing_time -0.014953 1.000000 -0.013571 0.015301 -0.046653
customer_id -0.047259 -0.013571 1.000000 -0.001218 0.017012
revenue 0.006676 0.015301 -0.001218 1.000000 0.020078
cost -0.005750 -0.046653 0.017012 0.020078 1.000000
3. How do you create a data story using Pandas
and data visualization?
In [9]: import matplotlib.pyplot as plt
monthly_sales = df['sales'].resample('M').sum()
plt.figure(figsize=(10, 5))
plt.plot(monthly_sales, marker='o')
plt.title("Monthly Sales Trend")
plt.xlabel("Month")
plt.ylabel("Sales")
plt.grid(True)
plt.show()
4. How do you communicate complex data
insights to non-technical stakeholders?
To communicate complex data insights to non-technical stakeholders, I:
1. Focus on the “So What?”
I highlight what the data means for the business — not just present the numbers.
2. Use Clear Visuals
I use simple charts and graphs (like bar charts or trend lines) to make the insights
intuitive and easy to digest.
3. Avoid Technical Jargon
I explain findings in plain language, such as saying “sales increased by 15% after
the campaign” instead of using statistical terms.
4. Tell a Story
I structure the insight like a story — beginning with the business problem, followed
by what the data shows, and ending with a recommended action.
5. How do you use Pandas to support business
decision-making?
In [10]: # Example: Which product has the highest average sales?
avg_sales_by_product = df.groupby('product')['sales'].mean().sort_values(ascending
print(avg_sales_by_product)
product
Product B 577.234310
Product C 560.983333
Product A 556.426877
Name: sales, dtype: float64
6. How do you use Pandas to identify areas for
process improvement?
In [11]: # Example: Find departments with longest average processing times
avg_processing_time = df.groupby('department')['processing_time'].mean().sort_values
print(avg_processing_time)
department
HR 5.167334
Operations 5.048581
Sales 4.934299
IT 4.892930
Name: processing_time, dtype: float64
7. How do you use Pandas to measure the
effectiveness of a business strategy?
In [13]: pre_campaign = df[df.index < '2024-01-01']['sales'].mean()
post_campaign = df[df.index >= '2024-01-01']['sales'].mean()
effectiveness = post_campaign - pre_campaign
print(f"Change in average sales: {effectiveness}")
Change in average sales: -25.73515325670496
In [14]: df.reset_index(inplace=True)
pre_campaign = df[df['date'] < '2024-01-01']['sales'].mean()
post_campaign = df[df['date'] >= '2024-01-01']['sales'].mean()
effectiveness = post_campaign - pre_campaign
print(f"Change in average sales: {effectiveness}")
Change in average sales: -25.73515325670496
8. How do you use Pandas to identify trends and
patterns in customer behavior?
In [15]: # Example: Frequency of purchases per customer
purchase_freq = df.groupby('customer_id').size().sort_values(ascending=False)
print(purchase_freq.head())
customer_id
147 7
424 6
369 6
431 5
41 5
dtype: int64
9. How do you use Pandas to create a data-
driven business case?
In [16]: # Example: Revenue generated per product
revenue = df.groupby('product')['sales'].sum().sort_values(ascending=False)
print(revenue)
product
Product A 140776
Product B 137959
Product C 134636
Name: sales, dtype: int64
10. How do you use Pandas to measure the
return on investment (ROI) of a business
initiative?
In [17]: # Example ROI calculation
total_gain = df[df['initiative'] == 'New Campaign']['revenue'].sum()
total_cost = df[df['initiative'] == 'New Campaign']['cost'].sum()
roi = (total_gain - total_cost) / total_cost * 100
print(f"ROI: {roi:.2f}%")
ROI: 92.30%
In [ ]: