0% found this document useful (0 votes)
34 views29 pages

Wipro Data Analyst Interview Questions

The document provides a comprehensive overview of SQL and Python concepts relevant for data analysis, including SQL joins, window functions, query optimization, and data handling techniques in Pandas. It also covers database normalization, indexing, and the creation of dashboards in Power BI, highlighting differences between Power BI and Tableau, as well as the use of DAX functions. The content is structured as a Q&A format, making it a useful resource for data analysts preparing for interviews or seeking to enhance their skills.

Uploaded by

Deeksha Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views29 pages

Wipro Data Analyst Interview Questions

The document provides a comprehensive overview of SQL and Python concepts relevant for data analysis, including SQL joins, window functions, query optimization, and data handling techniques in Pandas. It also covers database normalization, indexing, and the creation of dashboards in Power BI, highlighting differences between Power BI and Tableau, as well as the use of DAX functions. The content is structured as a Q&A format, making it a useful resource for data analysts preparing for interviews or seeking to enhance their skills.

Uploaded by

Deeksha Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Wipro Data Analyst

SQL & Database Questions

1. What are different types of SQL joins?


Joins are used to combine rows from two or more tables based on a related column
between them. The main types of joins are:

1. INNER JOIN

• Returns only matching rows between two tables.

SELECT e.employee_id, e.name, d.department_name

FROM employees e

INNER JOIN departments d ON e.department_id = d.department_id;

2. LEFT JOIN (LEFT OUTER JOIN)

• Returns all rows from the left table and matching rows from the right table.

SELECT e.employee_id, e.name, d.department_name

FROM employees e

LEFT JOIN departments d ON e.department_id = d.department_id;

3. RIGHT JOIN (RIGHT OUTER JOIN)

• Returns all rows from the right table and matching rows from the left table.

SELECT e.employee_id, e.name, d.department_name


FROM employees e

RIGHT JOIN departments d ON e.department_id = d.department_id;

4. FULL JOIN (FULL OUTER JOIN)

• Returns all rows from both tables, with NULLs where there are no matches.

SELECT e.employee_id, e.name, d.department_name

FROM employees e

FULL JOIN departments d ON e.department_id = d.department_id;

5. CROSS JOIN

• Produces the Cartesian product of both tables.

SELECT e.employee_id, e.name, d.department_name

FROM employees e

CROSS JOIN departments d;

2. How do you find duplicate records in a table?


To find duplicates, we use GROUP BY and HAVING COUNT(*) > 1.

SELECT employee_id, COUNT(*)

FROM employees

GROUP BY employee_id

HAVING COUNT(*) > 1;

To retrieve full duplicate records:

SELECT *

FROM employees e1

WHERE (SELECT COUNT(*) FROM employees e2 WHERE e1.name = e2.name) > 1;

To delete duplicates while keeping one:

DELETE FROM employees

WHERE employee_id NOT IN (


SELECT MIN(employee_id)

FROM employees

GROUP BY name

);

3. What is a window function in SQL? Give examples.


Window functions perform calculations across a set of table rows related to the
current row without collapsing the results.

1. ROW_NUMBER()

Assigns a unique row number within a partition.

SELECT employee_id, name, department_id,

ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS


rank

FROM employees;

2. RANK()

Gives ranking with gaps for ties.

SELECT employee_id, name, department_id,

RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank

FROM employees;

3. DENSE_RANK()

Same as RANK() but without gaps.

SELECT employee_id, name, department_id,

DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS


rank

FROM employees;

4. LAG() & LEAD()

Used to access previous or next row values.


SELECT employee_id, name, salary,

LAG(salary) OVER (ORDER BY salary DESC) AS prev_salary,

LEAD(salary) OVER (ORDER BY salary DESC) AS next_salary

FROM employees;

4. How do you optimize a slow SQL query?


Optimizing SQL queries involves:

1. Use Indexing – Index columns used in WHERE, JOIN, and ORDER BY.

CREATE INDEX idx_employee_id ON employees(employee_id);

2. **Avoid SELECT *** – Fetch only required columns.

SELECT employee_id, name FROM employees;

3. Use EXISTS instead of IN for subqueries

SELECT employee_id FROM employees WHERE EXISTS (

SELECT 1 FROM departments WHERE employees.department_id =


departments.department_id

);

4. Use Joins Efficiently – Prefer INNER JOIN over OUTER JOIN when possible.

5. Use Query Execution Plan – Use EXPLAIN in MySQL or EXPLAIN ANALYZE in


PostgreSQL.

EXPLAIN SELECT * FROM employees WHERE salary > 50000;

6. Partition Large Tables – If working with billions of records, partitioning helps.

CREATE TABLE employees_partitioned (

employee_id INT NOT NULL,

name VARCHAR(100),

salary INT,
department_id INT

) PARTITION BY RANGE (salary) (

PARTITION p1 VALUES LESS THAN (30000),

PARTITION p2 VALUES LESS THAN (60000),

PARTITION p3 VALUES LESS THAN (100000)

);

5. Explain the difference between a CTE and a subquery.


Feature CTE (Common Table Expression) Subquery

Readability More readable & reusable Hard to read

Performance Optimized for repeated use May execute multiple times

Recursion Supports recursion No recursion support

Scope Exists only in query scope Limited within parent query

CTE Example

WITH TopEmployees AS (

SELECT employee_id, name, salary FROM employees WHERE salary > 50000

SELECT * FROM TopEmployees;

Subquery Example

SELECT employee_id, name, salary

FROM employees

WHERE salary > (SELECT AVG(salary) FROM employees);

6. How would you design a normalized database schema?


Normalization helps reduce redundancy and improve integrity.
Normalization Steps

1. First Normal Form (1NF):

o Ensure atomicity (no repeating groups).

2. Second Normal Form (2NF):

o Remove partial dependencies (all non-key attributes must depend on the


primary key).

3. Third Normal Form (3NF):

o Remove transitive dependencies (attributes should depend only on the


primary key).

Example Schema

Unnormalized Table:

OrderID CustomerName ProductName Price

1 Alice Laptop 1000

2 Bob Mouse 20

Normalized Schema (3NF)

1. Customers Table

CREATE TABLE Customers (

CustomerID INT PRIMARY KEY,

Name VARCHAR(100)

);

2. Orders Table

CREATE TABLE Orders (

OrderID INT PRIMARY KEY,

CustomerID INT,

FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)

);
3. Products Table

CREATE TABLE Products (

ProductID INT PRIMARY KEY,

ProductName VARCHAR(100),

Price DECIMAL(10,2)

);

7. What is the use of indexing in SQL?


Indexes improve query performance by reducing the amount of data scanned.

Types of Indexes

1. Primary Index (Clustered Index)

CREATE TABLE employees (

employee_id INT PRIMARY KEY,

name VARCHAR(100)

);

2. Unique Index

CREATE UNIQUE INDEX idx_employee_name ON employees(name);

3. Composite Index

CREATE INDEX idx_name_dept ON employees(name, department_id);

4. Full-Text Index (for searching text)

CREATE FULLTEXT INDEX idx_search ON employees(name);

Best Practices:

• Index columns used in WHERE, JOIN, and ORDER BY.

• Avoid too many indexes (affects inserts/updates).

• Use EXPLAIN to check index usage.


Python for Data Analysis
8. How do you handle missing values in a dataset using Pandas?
Missing values can cause issues in data analysis, so handling them properly is crucial.
Pandas provides multiple ways to deal with missing values.

Checking for Missing Values

import pandas as pd

df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [None, 2, 3, 4]})

print(df.isnull()) # Returns True for missing values

print(df.isnull().sum()) # Count of missing values per column

Removing Missing Values

• Remove rows with missing values:

df.dropna(inplace=True)

• Remove columns with missing values:

df.dropna(axis=1, inplace=True)

Filling Missing Values

• Fill with a specific value:

df.fillna(0, inplace=True)

• Fill with column mean/median/mode:

df['A'].fillna(df['A'].mean(), inplace=True)

df['B'].fillna(df['B'].median(), inplace=True)

df['B'].fillna(df['B'].mode()[0], inplace=True)

• Forward fill (propagate last valid value forward):

df.fillna(method='ffill', inplace=True)

• Backward fill (propagate next valid value backward):


df.fillna(method='bfill', inplace=True)

9. What is the difference between a list and a tuple in Python?


Feature List Tuple

Mutability Mutable (can be changed) Immutable (cannot be changed)

Syntax list = [1, 2, 3] tuple = (1, 2, 3)

Performance Slower (since it can be modified) Faster (fixed memory allocation)

Memory Usage Uses more memory Uses less memory

Use Case When data needs to change When data should remain constant

Example

# List (mutable)

my_list = [1, 2, 3]

my_list.append(4) # Allowed

print(my_list) # [1, 2, 3, 4]

# Tuple (immutable)

my_tuple = (1, 2, 3)

# my_tuple.append(4) # Raises an AttributeError

10. How do you remove duplicate rows from a DataFrame?


To remove duplicate rows from a DataFrame, use drop_duplicates().

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David'],


'Age': [25, 30, 25, 40]}

df = pd.DataFrame(data)

# Remove duplicate rows

df.drop_duplicates(inplace=True)

print(df)

Removing Duplicates Based on Specific Columns

df.drop_duplicates(subset=['Name'], keep='first', inplace=True) # Keeps first


occurrence

df.drop_duplicates(subset=['Name'], keep='last', inplace=True) # Keeps last


occurrence

df.drop_duplicates(subset=['Name'], keep=False, inplace=True) # Removes all


duplicates

11. Explain the difference between apply() and map() functions


in Pandas.
Function Applies To Purpose Example

DataFrame & Applies a function element-wise


apply() df.apply(func, axis=1)
Series or row-wise

map() Series only Applies a function element-wise df['column'].map(func)

Example of apply()

Used when applying functions to rows or columns.

df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 30, 40]})

# Apply function to each element


df['A_squared'] = df['A'].apply(lambda x: x**2)

print(df)

Example of map()

Used for transforming values in a Series.

python

CopyEdit

df['B'] = df['B'].map(lambda x: x / 10)

print(df)

12. How do you perform group-by operations in Python?


The groupby() function in Pandas is used to group data and apply aggregate functions.

import pandas as pd

data = {'Department': ['HR', 'IT', 'HR', 'IT', 'Finance'],

'Salary': [50000, 60000, 55000, 65000, 70000]}

df = pd.DataFrame(data)

# Group by Department and calculate average salary

grouped_df = df.groupby('Department')['Salary'].mean()

print(grouped_df)

Other Aggregate Functions

• Count: df.groupby('Department')['Salary'].count()

• Sum: df.groupby('Department')['Salary'].sum()

• Max: df.groupby('Department')['Salary'].max()

• Min: df.groupby('Department')['Salary'].min()
Applying Multiple Aggregations

df.groupby('Department')['Salary'].agg(['mean', 'sum', 'max'])

13. How do you automate a data cleaning task using Python?


To automate data cleaning, write a function that applies standard cleaning techniques
and use it across datasets.

Example: Automating Data Cleaning

import pandas as pd

def clean_data(df):

# Drop duplicates

df.drop_duplicates(inplace=True)

# Fill missing values with column mean

df.fillna(df.mean(numeric_only=True), inplace=True)

# Standardize column names

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

return df

# Example usage

data = {'Name': ['Alice', 'Bob', 'Alice', None], 'Age': [25, 30, None, 40]}

df = pd.DataFrame(data)

cleaned_df = clean_data(df)
print(cleaned_df)

Automation Using a Script

Save the function in a script (data_cleaning.py) and reuse it.

import pandas as pd

from data_cleaning import clean_data

df = pd.read_csv('dirty_data.csv')

cleaned_df = clean_data(df)

cleaned_df.to_csv('cleaned_data.csv', index=False)

14. Explain lambda functions and their use in data analysis.


Lambda functions are small, anonymous functions that can be defined in a single line.

Syntax

lambda arguments: expression

Example: Using Lambda in Pandas

import pandas as pd

df = pd.DataFrame({'Numbers': [1, 2, 3, 4]})

# Apply lambda function to square numbers

df['Squared'] = df['Numbers'].apply(lambda x: x**2)

print(df)

Using Lambda in Sorting

data = [('Alice', 25), ('Bob', 30), ('Charlie', 20)]

data.sort(key=lambda x: x[1])

print(data) # Sorted by age


Using Lambda in Filtering

filtered_data = list(filter(lambda x: x[1] > 25, data))

print(filtered_data)

Why Use Lambda?

• Short and concise function definition.

• Useful for one-time operations in apply(), map(), filter(), and sort().

Data Visualization & BI Tools


15. How do you create a dashboard in Power BI?
Creating a dashboard in Power BI involves several steps:

Step 1: Load Data into Power BI

• Open Power BI Desktop.

• Click Home > Get Data and import data (Excel, SQL, CSV, etc.).

• Click Transform Data (Power Query) to clean and shape data.

Step 2: Create Data Model

• Define relationships between tables using the Model View.

• Create calculated columns, measures, or tables using DAX.

Step 3: Build Visualizations

• Use the Report View to add charts, tables, KPIs, and slicers.

• Common visuals: Bar charts, Line charts, Pie charts, Cards, Maps.

Step 4: Use DAX for Custom Calculations

• Example DAX measure to calculate Total Sales:

TotalSales = SUM(Sales[Amount])

Step 5: Create a Dashboard (in Power BI Service)


• Publish the report to Power BI Service.

• Go to Dashboards, click "Pin Visuals" to add key insights.

• Arrange and customize the layout.

Step 6: Enable Interactivity & Filters

• Add slicers (dropdowns, date filters).

• Use Drill-through pages to navigate deeper insights.

Step 7: Share & Schedule Refresh

• Share the dashboard with stakeholders.

• Set up scheduled refresh for automatic updates.

16. What is the difference between Power BI and Tableau?


Feature Power BI Tableau

Easier for beginners, integrated More complex but highly


Ease of Use
with Microsoft ecosystem customizable

More affordable (Power BI Pro More expensive (Tableau Creator


Cost
~$10/month) ~$70/month)

Data Seamless with Microsoft (Excel,


More external connectors
Connectivity Azure, SQL)

Efficient for small to mid-sized Better performance for large


Performance
datasets datasets

DAX vs. Uses Tableau Calculations & LOD


Uses DAX for calculations
Calculations Expressions

Cloud-based (Power BI Service) Mostly on-premises & Cloud


Deployment
& On-premises (Tableau Server, Tableau Online)

Visualization
Good, but limited customization More flexibility & aesthetics
Options
Power BI is great for Microsoft-based companies & affordability.
Tableau is preferred for advanced data visualization & interactivity.

17. Explain DAX functions and their use in Power BI.


DAX (Data Analysis Expressions) is a formula language used in Power BI to create
calculated columns, measures, and tables.

Common Types of DAX Functions

Aggregate Functions

TotalSales = SUM(Sales[Amount])

AverageSales = AVERAGE(Sales[Amount])

Filter Functions

FilteredSales = CALCULATE(SUM(Sales[Amount]), Sales[Region] = "East")

Time Intelligence Functions

SalesLastYear = CALCULATE(SUM(Sales[Amount]),
SAMEPERIODLASTYEAR(Sales[Date]))

Logical Functions

SalesCategory = IF(Sales[Amount] > 5000, "High", "Low")

Ranking Functions

SalesRank = RANKX(ALL(Sales[Product]), SUM(Sales[Amount]), , DESC, DENSE)

DAX is essential for advanced calculations, custom measures, and enhanced data
analysis in Power BI.

18. How do you handle real-time data updates in Power BI?


Power BI allows real-time data streaming for live dashboards.

Methods to Handle Real-Time Data

1. DirectQuery Mode
• Connect to live data sources (SQL Server, Snowflake, Azure).

• No data import; queries are run live.

• Suitable for real-time dashboards but may affect performance.

2. Streaming Dataset (Power BI Service)

• Push data from external sources via Power BI REST API.

• Use for IoT, sensor data, or real-time logs.

3. Auto Page Refresh (For DirectQuery)

• Go to Report Page > Format Pane > Page Refresh.

• Set refresh interval (e.g., every 5 seconds).

4. Scheduled Refresh for Imported Data

• In Power BI Service, go to Dataset Settings > Schedule Refresh.

• Use Gateways for on-premise SQL data sources.

Best approach depends on whether the data source supports live connections.

19. What are the best practices for report performance


optimization in Power BI?
Best practices to optimize Power BI performance:

1. Use Aggregations & Reduce Data Volume

• Avoid importing unnecessary columns & rows.

• Use SUMMARIZE() or GROUP BY to pre-aggregate data.

2. Optimize DAX Queries

• Prefer SUMX() over SUM() when working with row context.

• Avoid using CALCULATE() unnecessarily.

• Use variables to store intermediate calculations.

3. Optimize Relationships & Model Design


• Use Star Schema (fact and dimension tables) instead of Snowflake Schema.

• Reduce bidirectional relationships.

4. Enable Query Reduction Settings

• In Options > Query Reduction, enable "Reduce the number of queries sent to
the server".

5. Optimize Visuals

• Limit the number of visuals per page.

• Disable unnecessary interactivity & cross-filtering.

6. Use Incremental Refresh for Large Datasets

• Power BI Premium/Pro allows incremental data refresh, avoiding full dataset


refresh.

A combination of better data modeling, optimized DAX, and efficient data storage
significantly improves Power BI performance.

20. How do you implement row-level security (RLS) in Power BI?


Row-Level Security (RLS) restricts data visibility based on user roles.

Steps to Implement RLS:

Step 1: Define Roles in Power BI Desktop

1. Go to Modeling > Manage Roles.

2. Click Create Role and define a filter using DAX.

[Region] = "East" -- Users with this role will only see East region data

3. Save and assign roles.

Step 2: Assign Users in Power BI Service

1. Publish the report to Power BI Service.

2. Go to Dataset > Security.

3. Assign users or security groups to the created roles.


Step 3: Test RLS

• Use View as Role in Power BI Desktop to validate security rules.

Dynamic RLS (User-Based Filtering)

Instead of static values, use USERNAME() or USERPRINCIPALNAME() to filter data


dynamically.

[Email] = USERPRINCIPALNAME()

Dynamic RLS ensures that each user only sees their permitted data.

Final Thoughts

Power BI is powerful for creating interactive dashboards, managing large datasets,


and implementing security controls.
DAX functions help in advanced calculations and time intelligence analysis.
Best practices like optimizing models, reducing visuals, and using DirectQuery
improve performance.
Real-time data updates require streaming datasets, auto-refresh, or DirectQuery.

Statistical & Analytical Thinking

21. What are the key differences between descriptive and


inferential statistics?
Feature Descriptive Statistics Inferential Statistics

Draws conclusions from a


Definition Summarizes and organizes data.
sample to a population.

Makes predictions or
Provides insights into dataset
Purpose inferences about a larger
characteristics.
population.
Feature Descriptive Statistics Inferential Statistics

Measures of central tendency (mean,


Hypothesis testing,
median, mode), dispersion (variance,
Techniques confidence intervals,
standard deviation), and visualization
regression analysis.
(histograms, box plots).

"Based on a sample, the


"The average height of students in a class average height of all
Example
is 5.6 ft." students in a school is likely
5.6 ft ± 0.2 ft."

Descriptive statistics focus on summarizing data, while inferential statistics


generalize findings beyond the sample data.

22. How do you detect and handle outliers in a dataset?


Step 1: Detecting Outliers

Using Boxplot (IQR Method)

• Outliers are values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

• Python Example:

import pandas as pd

import numpy as np

data = pd.DataFrame({'values': [10, 12, 13, 15, 200, 17, 19, 21]})

Q1 = data['values'].quantile(0.25)

Q3 = data['values'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = data[(data['values'] < lower_bound) | (data['values'] > upper_bound)]


print(outliers)

Using Z-Score Method

• Any data point with a Z-score > 3 or < -3 is an outlier.

• Python Example:

from scipy import stats

data['zscore'] = np.abs(stats.zscore(data['values']))

outliers = data[data['zscore'] > 3]

print(outliers)

Step 2: Handling Outliers

Option 1: Remove Outliers (if data is erroneous or irrelevant).

data_cleaned = data[(data['values'] >= lower_bound) & (data['values'] <=


upper_bound)]

Option 2: Transform Data (Log/Square Root Transformation).

data['log_values'] = np.log(data['values'])

Option 3: Replace Outliers with Median or Mean.

data['values'] = np.where(data['values'] > upper_bound, data['values'].median(),


data['values'])

Choosing a method depends on whether the outlier is a genuine extreme value or


an error.

23. What is correlation? How do you measure it?


Correlation measures the strength and direction of the relationship between two
variables.

Types of Correlation

• Positive Correlation (r > 0): As X increases, Y increases (e.g., more study hours →
higher grades).
• Negative Correlation (r < 0): As X increases, Y decreases (e.g., more time on
social media → lower productivity).

• No Correlation (r ≈ 0): No relationship between variables.

Methods to Measure Correlation

Pearson Correlation (for linear relationships)

• Python Example:

import pandas as pd

df = pd.DataFrame({'X': [10, 20, 30, 40], 'Y': [15, 25, 35, 50]})

correlation = df['X'].corr(df['Y'])

print("Pearson Correlation:", correlation)

Spearman Rank Correlation (for non-linear data)

spearman_corr = df.corr(method='spearman')

Kendall’s Tau (for ordinal data)

kendall_corr = df.corr(method='kendall')

Pearson's r is most common, but Spearman/Kendall are better for non-linear


relationships.

24. Explain hypothesis testing and its significance in data


analysis.
Hypothesis Testing is a statistical method to determine if a sample result is
significant or just due to random chance.

Steps in Hypothesis Testing:

1 .State Null & Alternative Hypothesis


• Null Hypothesis (H₀): No effect or difference.

• Alternative Hypothesis (H₁): There is an effect or difference.

• Example: Does a new marketing strategy increase sales?

o H₀: The new strategy has no impact on sales.

o H₁: The new strategy increases sales.

2 .Select Significance Level (α)

• Common values: 0.05 (5%) or 0.01 (1%).

• α = 0.05 → 5% probability of wrongly rejecting H₀.

3 .Choose a Statistical Test

• t-test: Comparing means of two groups.

• Chi-Square test: Categorical variables association.

• ANOVA: Comparing more than two group means.

4 .Calculate p-value

• If p < α, reject H₀ (significant).

• If p > α, fail to reject H₀ (not significant).

Example (t-test in Python):

from scipy.stats import ttest_ind

group1 = [100, 120, 130, 140, 150]

group2 = [90, 95, 110, 115, 120]

t_stat, p_value = ttest_ind(group1, group2)

print("p-value:", p_value)

If p < 0.05, the two groups are significantly different.

Significance

• Used in A/B testing, medical trials, market research.

• Helps make data-driven decisions instead of assumptions.


25. How would you analyze sales data to find patterns?
Step 1: Data Understanding

• Explore data columns (sales amount, region, product category, time, etc.).

• Handle missing values and ensure consistency.

Step 2: Exploratory Data Analysis (EDA)

Trend Analysis (Find seasonal sales patterns)

import matplotlib.pyplot as plt

df['Month'] = pd.to_datetime(df['Date']).dt.month

df.groupby('Month')['Sales'].sum().plot(kind='line', marker='o')

plt.show()

Helps identify peak sales months.

Product-wise Sales Contribution

df.groupby('Product')['Sales'].sum().plot(kind='bar')

Find best-selling products.

Region-wise Sales Comparison

import seaborn as sns

sns.boxplot(x='Region', y='Sales', data=df)

Identify high-performing regions.

Step 3: Identify Correlations

• Does advertising spend impact sales?

df[['Sales', 'Ad_Spend']].corr()

Step 4: Predict Future Sales (Time Series Forecasting)

from statsmodels.tsa.holtwinters import ExponentialSmoothing

model = ExponentialSmoothing(df['Sales']).fit()
df['Forecast'] = model.predict(start=0, end=len(df))

Insights from analysis can guide promotions, inventory management, and


business strategy.

Business & Scenario-Based Questions

26. Suppose you're given raw customer transaction data. How


would you analyze it?
Step 1: Understand the Data

• Identify available columns (e.g., Customer_ID, Transaction_Date,


Product_Category, Amount_Spent).

• Check for missing values, duplicates, and data inconsistencies.

Step 2: Clean & Prepare the Data

• Handle missing values (fillna() for missing data, drop_duplicates() for duplicate
transactions).

• Convert date columns to datetime format for time-based analysis.

Step 3: Exploratory Data Analysis (EDA)


1 .Sales Trends Over Time

df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])

df.set_index('Transaction_Date')['Amount_Spent'].resample('M').sum().plot(kind='line')

Identifies seasonal trends and peak transaction months.

2 .Customer Segmentation (RFM Analysis)

• Recency (Days since last purchase)

• Frequency (Number of transactions)

• Monetary Value (Total spending)

import datetime as dt
today = dt.datetime.today()

df['Recency'] = (today - df.groupby('Customer_ID')['Transaction_Date'].max()).dt.days

df['Frequency'] = df.groupby('Customer_ID')['Transaction_Date'].count()

df['Monetary'] = df.groupby('Customer_ID')['Amount_Spent'].sum()

3 .Product Performance

df.groupby('Product_Category')['Amount_Spent'].sum().plot(kind='bar')

Helps determine the best-selling categories.

Step 4: Business Insights

• Identify high-value customers using RFM.

• Find low-performing products to optimize inventory.

• Suggest personalized promotions based on purchase patterns.

27. How do you present insights from data to non-technical


stakeholders?
1. Know Your Audience

• C-suite executives → Focus on high-level KPIs.

• Marketing → Show customer behavior insights.

• Operations → Highlight efficiency metrics.

2. Use Visual Storytelling

• Power BI/Tableau Dashboards for trends.

• Graphs & Charts instead of raw numbers.

• Comparisons (e.g., “Sales increased by 20% YoY”).

3. Keep It Simple & Actionable

• Avoid technical jargon (e.g., say "sales increased" instead of "revenue has an
upward linear regression trend").
• Use bullet points & structured slides.

4. Focus on Business Impact

• Instead of “Customer churn rate is 10%,” say “We lost 1,000 customers last
month, costing $50K in revenue”.

• Suggest next steps: “A loyalty program can reduce churn by 15%.”

5. Use Power BI for Interactive Reports

# Sample DAX for Dynamic Sales KPI

Sales_Growth = DIVIDE([Total Sales] - [Previous Sales], [Previous Sales])

The goal is to drive decisions, not just share data.

28. What key metrics would you track for an e-commerce


company?
1. Sales & Revenue Metrics

• Total Revenue → SUM(Sales_Amount)

• Average Order Value (AOV) → Revenue / Total Orders

• Sales Conversion Rate → Orders / Website Visitors

2. Customer Behavior Metrics

• Customer Retention Rate → Returning Customers / Total Customers

• Customer Lifetime Value (CLV) → (AOV × Purchase Frequency) × Customer


Lifespan

• Churn Rate → Lost Customers / Total Customers

3. Marketing Performance Metrics

• Customer Acquisition Cost (CAC) → Total Marketing Spend / New Customers

• Return on Ad Spend (ROAS) → Revenue from Ads / Ad Spend

4. Operational Metrics

• Cart Abandonment Rate → Abandoned Carts / Initiated Checkouts


• Fulfillment Time → Order Placed → Order Delivered Time

These metrics help optimize pricing, marketing, and inventory management.

29. How would you automate a weekly sales report in Power BI?
1. Connect to Data Source

• Use SQL Server, Excel, API, or Google Sheets.

• Schedule a data refresh in Power BI Service.

2. Create Key Sales Metrics

• Total Weekly Sales

Weekly_Sales = SUMX(FILTER(Sales, Sales[Date] >= TODAY() - 7), Sales[Amount])

• Sales Growth Rate

Sales_Growth = DIVIDE([This Week Sales] - [Last Week Sales], [Last Week Sales])

3. Design a Power BI Dashboard

• Sales Trend (Line Chart)

• Top Selling Products (Bar Chart)

• Region-Wise Sales (Map Visual)

4. Automate Report Delivery

• Publish the report to Power BI Service.

• Set up a Power BI Data Refresh Schedule (e.g., every Monday 8 AM).

• Use Power BI Subscriptions to email reports to stakeholders.

This ensures timely reporting without manual intervention.

30. What steps would you take to validate data accuracy before
creating a report?
1. Check for Missing & Duplicate Data
df.isnull().sum() # Check for missing values

df.duplicated().sum() # Check for duplicates

2. Ensure Data Consistency

• Standardize categories (upper(), strip()).

• Verify date formats.

3. Cross-Check Data with Source

• Compare Power BI report totals with SQL queries:

SELECT SUM(Sales_Amount) FROM Sales_Table WHERE Date BETWEEN '2024-01-01'


AND '2024-01-07';

• Validate sample records manually.

4. Look for Anomalies & Outliers

df.describe() # Check for unusually high/low values

5. Conduct Reconciliation Testing

• Compare Power BI visuals with backend data.

• Check for negative or unrealistic sales figures.

6. Get Business Validation

• Ask finance/sales teams to confirm key numbers before publishing.

Data validation prevents incorrect insights that could lead to poor business
decisions.

You might also like