0% found this document useful (0 votes)

38 views29 pages

Wipro Data Analyst Interview Questions

The document provides a comprehensive overview of SQL and Python concepts relevant for data analysis, including SQL joins, window functions, query optimization, and data handling techniques in Pandas. It also covers database normalization, indexing, and the creation of dashboards in Power BI, highlighting differences between Power BI and Tableau, as well as the use of DAX functions. The content is structured as a Q&A format, making it a useful resource for data analysts preparing for interviews or seeking to enhance their skills.

Uploaded by

Deeksha Shetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views29 pages

Wipro Data Analyst Interview Questions

Uploaded by

Deeksha Shetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Wipro Data Analyst

SQL & Database Questions

1. What are different types of SQL joins?

Joins are used to combine rows from two or more tables based on a related column
between them. The main types of joins are:

1. INNER JOIN

• Returns only matching rows between two tables.

SELECT e.employee_id, e.name, d.department_name

FROM employees e

INNER JOIN departments d ON e.department_id = d.department_id;

2. LEFT JOIN (LEFT OUTER JOIN)

• Returns all rows from the left table and matching rows from the right table.

SELECT e.employee_id, e.name, d.department_name

FROM employees e

LEFT JOIN departments d ON e.department_id = d.department_id;

3. RIGHT JOIN (RIGHT OUTER JOIN)

• Returns all rows from the right table and matching rows from the left table.

SELECT e.employee_id, e.name, d.department_name

FROM employees e

RIGHT JOIN departments d ON e.department_id = d.department_id;

4. FULL JOIN (FULL OUTER JOIN)

• Returns all rows from both tables, with NULLs where there are no matches.

SELECT e.employee_id, e.name, d.department_name

FROM employees e

FULL JOIN departments d ON e.department_id = d.department_id;

5. CROSS JOIN

• Produces the Cartesian product of both tables.

SELECT e.employee_id, e.name, d.department_name

FROM employees e

CROSS JOIN departments d;

2. How do you find duplicate records in a table?

To find duplicates, we use GROUP BY and HAVING COUNT(*) > 1.

SELECT employee_id, COUNT(*)

FROM employees

GROUP BY employee_id

HAVING COUNT(*) > 1;

To retrieve full duplicate records:

SELECT *

FROM employees e1

WHERE (SELECT COUNT(*) FROM employees e2 WHERE e1.name = e2.name) > 1;

To delete duplicates while keeping one:

DELETE FROM employees

WHERE employee_id NOT IN (

SELECT MIN(employee_id)

FROM employees

GROUP BY name

);

3. What is a window function in SQL? Give examples.

Window functions perform calculations across a set of table rows related to the
current row without collapsing the results.

1. ROW_NUMBER()

Assigns a unique row number within a partition.

SELECT employee_id, name, department_id,

ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS

rank

FROM employees;

2. RANK()

Gives ranking with gaps for ties.

SELECT employee_id, name, department_id,

RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank

FROM employees;

3. DENSE_RANK()

Same as RANK() but without gaps.

SELECT employee_id, name, department_id,

DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS

rank

FROM employees;

4. LAG() & LEAD()

Used to access previous or next row values.

SELECT employee_id, name, salary,

LAG(salary) OVER (ORDER BY salary DESC) AS prev_salary,

LEAD(salary) OVER (ORDER BY salary DESC) AS next_salary

FROM employees;

4. How do you optimize a slow SQL query?

Optimizing SQL queries involves:

1. Use Indexing – Index columns used in WHERE, JOIN, and ORDER BY.

CREATE INDEX idx_employee_id ON employees(employee_id);

2. Avoid SELECT * – Fetch only required columns.

SELECT employee_id, name FROM employees;

3. Use EXISTS instead of IN for subqueries

SELECT employee_id FROM employees WHERE EXISTS (

SELECT 1 FROM departments WHERE employees.department_id =

departments.department_id

);

4. Use Joins Efficiently – Prefer INNER JOIN over OUTER JOIN when possible.

5. Use Query Execution Plan – Use EXPLAIN in MySQL or EXPLAIN ANALYZE in

PostgreSQL.

EXPLAIN SELECT * FROM employees WHERE salary > 50000;

6. Partition Large Tables – If working with billions of records, partitioning helps.

CREATE TABLE employees_partitioned (

employee_id INT NOT NULL,

name VARCHAR(100),

salary INT,
department_id INT

) PARTITION BY RANGE (salary) (

PARTITION p1 VALUES LESS THAN (30000),

PARTITION p2 VALUES LESS THAN (60000),

PARTITION p3 VALUES LESS THAN (100000)

);

5. Explain the difference between a CTE and a subquery.

Feature CTE (Common Table Expression) Subquery

Readability More readable & reusable Hard to read

Performance Optimized for repeated use May execute multiple times

Recursion Supports recursion No recursion support

Scope Exists only in query scope Limited within parent query

CTE Example

WITH TopEmployees AS (

SELECT employee_id, name, salary FROM employees WHERE salary > 50000

SELECT * FROM TopEmployees;

Subquery Example

SELECT employee_id, name, salary

FROM employees

WHERE salary > (SELECT AVG(salary) FROM employees);

6. How would you design a normalized database schema?

Normalization helps reduce redundancy and improve integrity.
Normalization Steps

1. First Normal Form (1NF):

o Ensure atomicity (no repeating groups).

2. Second Normal Form (2NF):

o Remove partial dependencies (all non-key attributes must depend on the

primary key).

3. Third Normal Form (3NF):

o Remove transitive dependencies (attributes should depend only on the

primary key).

Example Schema

Unnormalized Table:

OrderID CustomerName ProductName Price

1 Alice Laptop 1000

2 Bob Mouse 20

Normalized Schema (3NF)

1. Customers Table

CREATE TABLE Customers (

CustomerID INT PRIMARY KEY,

Name VARCHAR(100)

);

2. Orders Table

CREATE TABLE Orders (

OrderID INT PRIMARY KEY,

CustomerID INT,

FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)

);
3. Products Table

CREATE TABLE Products (

ProductID INT PRIMARY KEY,

ProductName VARCHAR(100),

Price DECIMAL(10,2)

);

7. What is the use of indexing in SQL?

Indexes improve query performance by reducing the amount of data scanned.

Types of Indexes

1. Primary Index (Clustered Index)

CREATE TABLE employees (

employee_id INT PRIMARY KEY,

name VARCHAR(100)

);

2. Unique Index

CREATE UNIQUE INDEX idx_employee_name ON employees(name);

3. Composite Index

CREATE INDEX idx_name_dept ON employees(name, department_id);

4. Full-Text Index (for searching text)

CREATE FULLTEXT INDEX idx_search ON employees(name);

Best Practices:

• Index columns used in WHERE, JOIN, and ORDER BY.

• Avoid too many indexes (affects inserts/updates).

• Use EXPLAIN to check index usage.

Python for Data Analysis
8. How do you handle missing values in a dataset using Pandas?
Missing values can cause issues in data analysis, so handling them properly is crucial.
Pandas provides multiple ways to deal with missing values.

Checking for Missing Values

import pandas as pd

df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [None, 2, 3, 4]})

print(df.isnull()) # Returns True for missing values

print(df.isnull().sum()) # Count of missing values per column

Removing Missing Values

• Remove rows with missing values:

df.dropna(inplace=True)

• Remove columns with missing values:

df.dropna(axis=1, inplace=True)

Filling Missing Values

• Fill with a specific value:

df.fillna(0, inplace=True)

• Fill with column mean/median/mode:

df['A'].fillna(df['A'].mean(), inplace=True)

df['B'].fillna(df['B'].median(), inplace=True)

df['B'].fillna(df['B'].mode()[0], inplace=True)

• Forward fill (propagate last valid value forward):

df.fillna(method='ffill', inplace=True)

• Backward fill (propagate next valid value backward):

df.fillna(method='bfill', inplace=True)

9. What is the difference between a list and a tuple in Python?

Feature List Tuple

Mutability Mutable (can be changed) Immutable (cannot be changed)

Syntax list = [1, 2, 3] tuple = (1, 2, 3)

Performance Slower (since it can be modified) Faster (fixed memory allocation)

Memory Usage Uses more memory Uses less memory

Use Case When data needs to change When data should remain constant

Example

# List (mutable)

my_list = [1, 2, 3]

my_list.append(4) # Allowed

print(my_list) # [1, 2, 3, 4]

# Tuple (immutable)

my_tuple = (1, 2, 3)

# my_tuple.append(4) # Raises an AttributeError

10. How do you remove duplicate rows from a DataFrame?

To remove duplicate rows from a DataFrame, use drop_duplicates().

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David'],

'Age': [25, 30, 25, 40]}

df = pd.DataFrame(data)

# Remove duplicate rows

df.drop_duplicates(inplace=True)

print(df)

Removing Duplicates Based on Specific Columns

df.drop_duplicates(subset=['Name'], keep='first', inplace=True) # Keeps first

occurrence

df.drop_duplicates(subset=['Name'], keep='last', inplace=True) # Keeps last

occurrence

df.drop_duplicates(subset=['Name'], keep=False, inplace=True) # Removes all

duplicates

11. Explain the difference between apply() and map() functions

in Pandas.
Function Applies To Purpose Example

DataFrame & Applies a function element-wise

apply() df.apply(func, axis=1)
Series or row-wise

map() Series only Applies a function element-wise df['column'].map(func)

Example of apply()

Used when applying functions to rows or columns.

df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 30, 40]})

# Apply function to each element

df['A_squared'] = df['A'].apply(lambda x: x**2)

print(df)

Example of map()

Used for transforming values in a Series.

python

CopyEdit

df['B'] = df['B'].map(lambda x: x / 10)

print(df)

12. How do you perform group-by operations in Python?

The groupby() function in Pandas is used to group data and apply aggregate functions.

import pandas as pd

data = {'Department': ['HR', 'IT', 'HR', 'IT', 'Finance'],

'Salary': [50000, 60000, 55000, 65000, 70000]}

df = pd.DataFrame(data)

# Group by Department and calculate average salary

grouped_df = df.groupby('Department')['Salary'].mean()

print(grouped_df)

Other Aggregate Functions

• Count: df.groupby('Department')['Salary'].count()

• Sum: df.groupby('Department')['Salary'].sum()

• Max: df.groupby('Department')['Salary'].max()

• Min: df.groupby('Department')['Salary'].min()
Applying Multiple Aggregations

df.groupby('Department')['Salary'].agg(['mean', 'sum', 'max'])

13. How do you automate a data cleaning task using Python?

To automate data cleaning, write a function that applies standard cleaning techniques
and use it across datasets.

Example: Automating Data Cleaning

import pandas as pd

def clean_data(df):

# Drop duplicates

df.drop_duplicates(inplace=True)

# Fill missing values with column mean

df.fillna(df.mean(numeric_only=True), inplace=True)

# Standardize column names

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

return df

# Example usage

data = {'Name': ['Alice', 'Bob', 'Alice', None], 'Age': [25, 30, None, 40]}

df = pd.DataFrame(data)

cleaned_df = clean_data(df)
print(cleaned_df)

Automation Using a Script

Save the function in a script (data_cleaning.py) and reuse it.

import pandas as pd

from data_cleaning import clean_data

df = pd.read_csv('dirty_data.csv')

cleaned_df = clean_data(df)

cleaned_df.to_csv('cleaned_data.csv', index=False)

14. Explain lambda functions and their use in data analysis.

Lambda functions are small, anonymous functions that can be defined in a single line.

Syntax

lambda arguments: expression

Example: Using Lambda in Pandas

import pandas as pd

df = pd.DataFrame({'Numbers': [1, 2, 3, 4]})

# Apply lambda function to square numbers

df['Squared'] = df['Numbers'].apply(lambda x: x**2)

print(df)

Using Lambda in Sorting

data = [('Alice', 25), ('Bob', 30), ('Charlie', 20)]

data.sort(key=lambda x: x[1])

print(data) # Sorted by age

Using Lambda in Filtering

filtered_data = list(filter(lambda x: x[1] > 25, data))

print(filtered_data)

Why Use Lambda?

• Short and concise function definition.

• Useful for one-time operations in apply(), map(), filter(), and sort().

Data Visualization & BI Tools

15. How do you create a dashboard in Power BI?
Creating a dashboard in Power BI involves several steps:

Step 1: Load Data into Power BI

• Open Power BI Desktop.

• Click Home > Get Data and import data (Excel, SQL, CSV, etc.).

• Click Transform Data (Power Query) to clean and shape data.

Step 2: Create Data Model

• Define relationships between tables using the Model View.

• Create calculated columns, measures, or tables using DAX.

Step 3: Build Visualizations

• Use the Report View to add charts, tables, KPIs, and slicers.

• Common visuals: Bar charts, Line charts, Pie charts, Cards, Maps.

Step 4: Use DAX for Custom Calculations

• Example DAX measure to calculate Total Sales:

TotalSales = SUM(Sales[Amount])

Step 5: Create a Dashboard (in Power BI Service)

• Publish the report to Power BI Service.

• Go to Dashboards, click "Pin Visuals" to add key insights.

• Arrange and customize the layout.

Step 6: Enable Interactivity & Filters

• Add slicers (dropdowns, date filters).

• Use Drill-through pages to navigate deeper insights.

Step 7: Share & Schedule Refresh

• Share the dashboard with stakeholders.

• Set up scheduled refresh for automatic updates.

16. What is the difference between Power BI and Tableau?

Feature Power BI Tableau

Easier for beginners, integrated More complex but highly

Ease of Use
with Microsoft ecosystem customizable

More affordable (Power BI Pro More expensive (Tableau Creator

Cost
~$10/month) ~$70/month)

Data Seamless with Microsoft (Excel,

More external connectors
Connectivity Azure, SQL)

Efficient for small to mid-sized Better performance for large

Performance
datasets datasets

DAX vs. Uses Tableau Calculations & LOD

Uses DAX for calculations
Calculations Expressions

Cloud-based (Power BI Service) Mostly on-premises & Cloud

Deployment
& On-premises (Tableau Server, Tableau Online)

Visualization
Good, but limited customization More flexibility & aesthetics
Options
Power BI is great for Microsoft-based companies & affordability.
Tableau is preferred for advanced data visualization & interactivity.

17. Explain DAX functions and their use in Power BI.

DAX (Data Analysis Expressions) is a formula language used in Power BI to create
calculated columns, measures, and tables.

Common Types of DAX Functions

Aggregate Functions

TotalSales = SUM(Sales[Amount])

AverageSales = AVERAGE(Sales[Amount])

Filter Functions

FilteredSales = CALCULATE(SUM(Sales[Amount]), Sales[Region] = "East")

Time Intelligence Functions

SalesLastYear = CALCULATE(SUM(Sales[Amount]),
SAMEPERIODLASTYEAR(Sales[Date]))

Logical Functions

SalesCategory = IF(Sales[Amount] > 5000, "High", "Low")

Ranking Functions

SalesRank = RANKX(ALL(Sales[Product]), SUM(Sales[Amount]), , DESC, DENSE)

DAX is essential for advanced calculations, custom measures, and enhanced data
analysis in Power BI.

18. How do you handle real-time data updates in Power BI?

Power BI allows real-time data streaming for live dashboards.

Methods to Handle Real-Time Data

1. DirectQuery Mode
• Connect to live data sources (SQL Server, Snowflake, Azure).

• No data import; queries are run live.

• Suitable for real-time dashboards but may affect performance.

2. Streaming Dataset (Power BI Service)

• Push data from external sources via Power BI REST API.

• Use for IoT, sensor data, or real-time logs.

3. Auto Page Refresh (For DirectQuery)

• Go to Report Page > Format Pane > Page Refresh.

• Set refresh interval (e.g., every 5 seconds).

4. Scheduled Refresh for Imported Data

• In Power BI Service, go to Dataset Settings > Schedule Refresh.

• Use Gateways for on-premise SQL data sources.

Best approach depends on whether the data source supports live connections.

19. What are the best practices for report performance

optimization in Power BI?
Best practices to optimize Power BI performance:

1. Use Aggregations & Reduce Data Volume

• Avoid importing unnecessary columns & rows.

• Use SUMMARIZE() or GROUP BY to pre-aggregate data.

2. Optimize DAX Queries

• Prefer SUMX() over SUM() when working with row context.

• Avoid using CALCULATE() unnecessarily.

• Use variables to store intermediate calculations.

3. Optimize Relationships & Model Design

• Use Star Schema (fact and dimension tables) instead of Snowflake Schema.

• Reduce bidirectional relationships.

4. Enable Query Reduction Settings

• In Options > Query Reduction, enable "Reduce the number of queries sent to
the server".

5. Optimize Visuals

• Limit the number of visuals per page.

• Disable unnecessary interactivity & cross-filtering.

6. Use Incremental Refresh for Large Datasets

• Power BI Premium/Pro allows incremental data refresh, avoiding full dataset

refresh.

A combination of better data modeling, optimized DAX, and efficient data storage
significantly improves Power BI performance.

20. How do you implement row-level security (RLS) in Power BI?

Row-Level Security (RLS) restricts data visibility based on user roles.

Steps to Implement RLS:

Step 1: Define Roles in Power BI Desktop

1. Go to Modeling > Manage Roles.

2. Click Create Role and define a filter using DAX.

[Region] = "East" -- Users with this role will only see East region data

3. Save and assign roles.

Step 2: Assign Users in Power BI Service

1. Publish the report to Power BI Service.

2. Go to Dataset > Security.

3. Assign users or security groups to the created roles.

Step 3: Test RLS

• Use View as Role in Power BI Desktop to validate security rules.

Dynamic RLS (User-Based Filtering)

Instead of static values, use USERNAME() or USERPRINCIPALNAME() to filter data

dynamically.

[Email] = USERPRINCIPALNAME()

Dynamic RLS ensures that each user only sees their permitted data.

Final Thoughts

Power BI is powerful for creating interactive dashboards, managing large datasets,

and implementing security controls.
DAX functions help in advanced calculations and time intelligence analysis.
Best practices like optimizing models, reducing visuals, and using DirectQuery
improve performance.
Real-time data updates require streaming datasets, auto-refresh, or DirectQuery.

Statistical & Analytical Thinking

21. What are the key differences between descriptive and

inferential statistics?
Feature Descriptive Statistics Inferential Statistics

Draws conclusions from a

Definition Summarizes and organizes data.
sample to a population.

Makes predictions or
Provides insights into dataset
Purpose inferences about a larger
characteristics.
population.
Feature Descriptive Statistics Inferential Statistics

Measures of central tendency (mean,

Hypothesis testing,
median, mode), dispersion (variance,
Techniques confidence intervals,
standard deviation), and visualization
regression analysis.
(histograms, box plots).

"Based on a sample, the

"The average height of students in a class average height of all
Example
is 5.6 ft." students in a school is likely
5.6 ft ± 0.2 ft."

Descriptive statistics focus on summarizing data, while inferential statistics

generalize findings beyond the sample data.

22. How do you detect and handle outliers in a dataset?

Step 1: Detecting Outliers

Using Boxplot (IQR Method)

• Outliers are values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

• Python Example:

import pandas as pd

import numpy as np

data = pd.DataFrame({'values': [10, 12, 13, 15, 200, 17, 19, 21]})

Q1 = data['values'].quantile(0.25)

Q3 = data['values'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = data[(data['values'] < lower_bound) | (data['values'] > upper_bound)]

print(outliers)

Using Z-Score Method

• Any data point with a Z-score > 3 or < -3 is an outlier.

• Python Example:

from scipy import stats

data['zscore'] = np.abs(stats.zscore(data['values']))

outliers = data[data['zscore'] > 3]

print(outliers)

Step 2: Handling Outliers

Option 1: Remove Outliers (if data is erroneous or irrelevant).

data_cleaned = data[(data['values'] >= lower_bound) & (data['values'] <=

upper_bound)]

Option 2: Transform Data (Log/Square Root Transformation).

data['log_values'] = np.log(data['values'])

Option 3: Replace Outliers with Median or Mean.

data['values'] = np.where(data['values'] > upper_bound, data['values'].median(),

data['values'])

Choosing a method depends on whether the outlier is a genuine extreme value or

an error.

23. What is correlation? How do you measure it?

Correlation measures the strength and direction of the relationship between two
variables.

Types of Correlation

• Positive Correlation (r > 0): As X increases, Y increases (e.g., more study hours →
higher grades).
• Negative Correlation (r < 0): As X increases, Y decreases (e.g., more time on
social media → lower productivity).

• No Correlation (r ≈ 0): No relationship between variables.

Methods to Measure Correlation

Pearson Correlation (for linear relationships)

• Python Example:

import pandas as pd

df = pd.DataFrame({'X': [10, 20, 30, 40], 'Y': [15, 25, 35, 50]})

correlation = df['X'].corr(df['Y'])

print("Pearson Correlation:", correlation)

Spearman Rank Correlation (for non-linear data)

spearman_corr = df.corr(method='spearman')

Kendall’s Tau (for ordinal data)

kendall_corr = df.corr(method='kendall')

Pearson's r is most common, but Spearman/Kendall are better for non-linear

relationships.

24. Explain hypothesis testing and its significance in data

analysis.
Hypothesis Testing is a statistical method to determine if a sample result is
significant or just due to random chance.

Steps in Hypothesis Testing:

1 .State Null & Alternative Hypothesis

• Null Hypothesis (H₀): No effect or difference.

• Alternative Hypothesis (H₁): There is an effect or difference.

• Example: Does a new marketing strategy increase sales?

o H₀: The new strategy has no impact on sales.

o H₁: The new strategy increases sales.

2 .Select Significance Level (α)

• Common values: 0.05 (5%) or 0.01 (1%).

• α = 0.05 → 5% probability of wrongly rejecting H₀.

3 .Choose a Statistical Test

• t-test: Comparing means of two groups.

• Chi-Square test: Categorical variables association.

• ANOVA: Comparing more than two group means.

4 .Calculate p-value

• If p < α, reject H₀ (significant).

• If p > α, fail to reject H₀ (not significant).

Example (t-test in Python):

from scipy.stats import ttest_ind

group1 = [100, 120, 130, 140, 150]

group2 = [90, 95, 110, 115, 120]

t_stat, p_value = ttest_ind(group1, group2)

print("p-value:", p_value)

If p < 0.05, the two groups are significantly different.

Significance

• Used in A/B testing, medical trials, market research.

• Helps make data-driven decisions instead of assumptions.

25. How would you analyze sales data to find patterns?
Step 1: Data Understanding

• Explore data columns (sales amount, region, product category, time, etc.).

• Handle missing values and ensure consistency.

Step 2: Exploratory Data Analysis (EDA)

Trend Analysis (Find seasonal sales patterns)

import matplotlib.pyplot as plt

df['Month'] = pd.to_datetime(df['Date']).dt.month

df.groupby('Month')['Sales'].sum().plot(kind='line', marker='o')

plt.show()

Helps identify peak sales months.

Product-wise Sales Contribution

df.groupby('Product')['Sales'].sum().plot(kind='bar')

Find best-selling products.

Region-wise Sales Comparison

import seaborn as sns

sns.boxplot(x='Region', y='Sales', data=df)

Identify high-performing regions.

Step 3: Identify Correlations

• Does advertising spend impact sales?

df[['Sales', 'Ad_Spend']].corr()

Step 4: Predict Future Sales (Time Series Forecasting)

from statsmodels.tsa.holtwinters import ExponentialSmoothing

model = ExponentialSmoothing(df['Sales']).fit()
df['Forecast'] = model.predict(start=0, end=len(df))

Insights from analysis can guide promotions, inventory management, and

business strategy.

Business & Scenario-Based Questions

26. Suppose you're given raw customer transaction data. How

would you analyze it?
Step 1: Understand the Data

• Identify available columns (e.g., Customer_ID, Transaction_Date,

Product_Category, Amount_Spent).

• Check for missing values, duplicates, and data inconsistencies.

Step 2: Clean & Prepare the Data

• Handle missing values (fillna() for missing data, drop_duplicates() for duplicate
transactions).

• Convert date columns to datetime format for time-based analysis.

Step 3: Exploratory Data Analysis (EDA)

1 .Sales Trends Over Time

df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])

df.set_index('Transaction_Date')['Amount_Spent'].resample('M').sum().plot(kind='line')

Identifies seasonal trends and peak transaction months.

2 .Customer Segmentation (RFM Analysis)

• Recency (Days since last purchase)

• Frequency (Number of transactions)

• Monetary Value (Total spending)

import datetime as dt
today = dt.datetime.today()

df['Recency'] = (today - df.groupby('Customer_ID')['Transaction_Date'].max()).dt.days

df['Frequency'] = df.groupby('Customer_ID')['Transaction_Date'].count()

df['Monetary'] = df.groupby('Customer_ID')['Amount_Spent'].sum()

3 .Product Performance

df.groupby('Product_Category')['Amount_Spent'].sum().plot(kind='bar')

Helps determine the best-selling categories.

Step 4: Business Insights

• Identify high-value customers using RFM.

• Find low-performing products to optimize inventory.

• Suggest personalized promotions based on purchase patterns.

27. How do you present insights from data to non-technical

stakeholders?
1. Know Your Audience

• C-suite executives → Focus on high-level KPIs.

• Marketing → Show customer behavior insights.

• Operations → Highlight efficiency metrics.

2. Use Visual Storytelling

• Power BI/Tableau Dashboards for trends.

• Graphs & Charts instead of raw numbers.

• Comparisons (e.g., “Sales increased by 20% YoY”).

3. Keep It Simple & Actionable

• Avoid technical jargon (e.g., say "sales increased" instead of "revenue has an
upward linear regression trend").
• Use bullet points & structured slides.

4. Focus on Business Impact

• Instead of “Customer churn rate is 10%,” say “We lost 1,000 customers last
month, costing $50K in revenue”.

• Suggest next steps: “A loyalty program can reduce churn by 15%.”

5. Use Power BI for Interactive Reports

# Sample DAX for Dynamic Sales KPI

Sales_Growth = DIVIDE([Total Sales] - [Previous Sales], [Previous Sales])

The goal is to drive decisions, not just share data.

28. What key metrics would you track for an e-commerce

company?
1. Sales & Revenue Metrics

• Total Revenue → SUM(Sales_Amount)

• Average Order Value (AOV) → Revenue / Total Orders

• Sales Conversion Rate → Orders / Website Visitors

2. Customer Behavior Metrics

• Customer Retention Rate → Returning Customers / Total Customers

• Customer Lifetime Value (CLV) → (AOV × Purchase Frequency) × Customer

Lifespan

• Churn Rate → Lost Customers / Total Customers

3. Marketing Performance Metrics

• Customer Acquisition Cost (CAC) → Total Marketing Spend / New Customers

• Return on Ad Spend (ROAS) → Revenue from Ads / Ad Spend

4. Operational Metrics

• Cart Abandonment Rate → Abandoned Carts / Initiated Checkouts

• Fulfillment Time → Order Placed → Order Delivered Time

These metrics help optimize pricing, marketing, and inventory management.

29. How would you automate a weekly sales report in Power BI?
1. Connect to Data Source

• Use SQL Server, Excel, API, or Google Sheets.

• Schedule a data refresh in Power BI Service.

2. Create Key Sales Metrics

• Total Weekly Sales

Weekly_Sales = SUMX(FILTER(Sales, Sales[Date] >= TODAY() - 7), Sales[Amount])

• Sales Growth Rate

Sales_Growth = DIVIDE([This Week Sales] - [Last Week Sales], [Last Week Sales])

3. Design a Power BI Dashboard

• Sales Trend (Line Chart)

• Top Selling Products (Bar Chart)

• Region-Wise Sales (Map Visual)

4. Automate Report Delivery

• Publish the report to Power BI Service.

• Set up a Power BI Data Refresh Schedule (e.g., every Monday 8 AM).

• Use Power BI Subscriptions to email reports to stakeholders.

This ensures timely reporting without manual intervention.

30. What steps would you take to validate data accuracy before
creating a report?
1. Check for Missing & Duplicate Data
df.isnull().sum() # Check for missing values

df.duplicated().sum() # Check for duplicates

2. Ensure Data Consistency

• Standardize categories (upper(), strip()).

• Verify date formats.

3. Cross-Check Data with Source

• Compare Power BI report totals with SQL queries:

SELECT SUM(Sales_Amount) FROM Sales_Table WHERE Date BETWEEN '2024-01-01'

AND '2024-01-07';

• Validate sample records manually.

4. Look for Anomalies & Outliers

df.describe() # Check for unusually high/low values

5. Conduct Reconciliation Testing

• Compare Power BI visuals with backend data.

• Check for negative or unrealistic sales figures.

6. Get Business Validation

• Ask finance/sales teams to confirm key numbers before publishing.

Data validation prevents incorrect insights that could lead to poor business
decisions.

Reportfile (Expense Tracker)
No ratings yet
Reportfile (Expense Tracker)
47 pages
100 SQL Questions With Real Examples-2
100% (1)
100 SQL Questions With Real Examples-2
16 pages
Learn Advanced SQL
No ratings yet
Learn Advanced SQL
48 pages
Exam
No ratings yet
Exam
7 pages
Salesforce Admin Notes Basic To Advanced
No ratings yet
Salesforce Admin Notes Basic To Advanced
3 pages
SQL - Queries Update
No ratings yet
SQL - Queries Update
6 pages
Sales Report Analysis Project For IP
No ratings yet
Sales Report Analysis Project For IP
17 pages
National OVC Management Information System
No ratings yet
National OVC Management Information System
24 pages
CH 3-Interactive SQL & Advance SQL: Performance Tuning (INBUILT Functions - String & Arithmatic Functions String Functions
No ratings yet
CH 3-Interactive SQL & Advance SQL: Performance Tuning (INBUILT Functions - String & Arithmatic Functions String Functions
40 pages
SQL Query Interview Questions With Answers
100% (1)
SQL Query Interview Questions With Answers
16 pages
SQL Scenarion Based Basics
No ratings yet
SQL Scenarion Based Basics
25 pages
Advanced SQL Concepts
No ratings yet
Advanced SQL Concepts
55 pages
Dbms Solved Questions
No ratings yet
Dbms Solved Questions
18 pages
TCS Data Analyst Interview Questions
No ratings yet
TCS Data Analyst Interview Questions
8 pages
Advanced SQL Concepts
No ratings yet
Advanced SQL Concepts
38 pages
BI Important Notes
No ratings yet
BI Important Notes
57 pages
Operations Analyst Take-Home Test Questions
No ratings yet
Operations Analyst Take-Home Test Questions
4 pages
Most Confusing SQL Functions
No ratings yet
Most Confusing SQL Functions
27 pages
SQL
No ratings yet
SQL
5 pages
Dbms Mid Imp With Answers
No ratings yet
Dbms Mid Imp With Answers
19 pages
Tech Mahindra SQL Interview Questions For Data Engineer
No ratings yet
Tech Mahindra SQL Interview Questions For Data Engineer
6 pages
1) Union and Union All
No ratings yet
1) Union and Union All
11 pages
Chap 3 For Students SAHIBA
No ratings yet
Chap 3 For Students SAHIBA
217 pages
Sage Business Cloud Accounting Practitioner
No ratings yet
Sage Business Cloud Accounting Practitioner
302 pages
Social Media Analytics and Data Analysis (UNIT 5)
No ratings yet
Social Media Analytics and Data Analysis (UNIT 5)
14 pages
??? ????? ??? ?????????
No ratings yet
??? ????? ??? ?????????
15 pages
Cognizant Interview Guide
No ratings yet
Cognizant Interview Guide
8 pages
Myplant APM Overview Rev. 2
No ratings yet
Myplant APM Overview Rev. 2
46 pages
Frequently Asked Interview Questions For Data Analyst Role
No ratings yet
Frequently Asked Interview Questions For Data Analyst Role
17 pages
ITM 5912 - Assignment 6
No ratings yet
ITM 5912 - Assignment 6
35 pages
MS SQL Server Interview Questions
100% (4)
MS SQL Server Interview Questions
17 pages
SQL Doc
No ratings yet
SQL Doc
39 pages
100 SQL Fundamental Interview Question With Answers
No ratings yet
100 SQL Fundamental Interview Question With Answers
33 pages
Community Manager Playbook 12-2022
No ratings yet
Community Manager Playbook 12-2022
32 pages
RTDB
No ratings yet
RTDB
88 pages
Financial Service Cloud User Guide
No ratings yet
Financial Service Cloud User Guide
141 pages
Interview - 7 - IMP
No ratings yet
Interview - 7 - IMP
26 pages
SQL
No ratings yet
SQL
19 pages
Datacamp Framework DS
No ratings yet
Datacamp Framework DS
25 pages
Shortlisting - Acropolis
No ratings yet
Shortlisting - Acropolis
7 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
3 pages
Frequently Asked Interview Questions For Data Analyst Role
No ratings yet
Frequently Asked Interview Questions For Data Analyst Role
12 pages
QuickBooks Online (2022) D5 L2 Projects
No ratings yet
QuickBooks Online (2022) D5 L2 Projects
5 pages
LEFT Function
No ratings yet
LEFT Function
12 pages
Power Bi Topics To Learn
No ratings yet
Power Bi Topics To Learn
3 pages
SQL Query
No ratings yet
SQL Query
14 pages
Ibm Watson Iot Platform.
No ratings yet
Ibm Watson Iot Platform.
7 pages
Day 10 1729086189
No ratings yet
Day 10 1729086189
14 pages
FP&a Foundational Course Notes
No ratings yet
FP&a Foundational Course Notes
6 pages
ThoughtSpot Dashboards Are Dead
No ratings yet
ThoughtSpot Dashboards Are Dead
21 pages
Vikash Raj 2113047 NITS
No ratings yet
Vikash Raj 2113047 NITS
1 page
Important SQL Q&A
No ratings yet
Important SQL Q&A
18 pages
SQL Latest
No ratings yet
SQL Latest
7 pages
Excel Dashboard Design Essentials Course Content
No ratings yet
Excel Dashboard Design Essentials Course Content
2 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
3 pages
Unit 3
No ratings yet
Unit 3
4 pages
(Nov-2021) New PassLeader PL-400 Exam Dumps
No ratings yet
(Nov-2021) New PassLeader PL-400 Exam Dumps
8 pages
Expert Level SQL
No ratings yet
Expert Level SQL
10 pages
Git Hub
No ratings yet
Git Hub
6 pages
SQL 2
No ratings yet
SQL 2
15 pages
Most Asked SQL Queries in Interview
No ratings yet
Most Asked SQL Queries in Interview
6 pages
Ebook - The Ultimate Beginner Guide To A Professional Node-RED-V!
No ratings yet
Ebook - The Ultimate Beginner Guide To A Professional Node-RED-V!
35 pages
DEBasic Test Que NAns
No ratings yet
DEBasic Test Que NAns
15 pages
SQLNOTES
No ratings yet
SQLNOTES
39 pages
SQL Exam Questoins
No ratings yet
SQL Exam Questoins
6 pages
SQL Notes
No ratings yet
SQL Notes
5 pages
SQL Query Tutorial
No ratings yet
SQL Query Tutorial
12 pages
Detailed SQL Interview Questions
No ratings yet
Detailed SQL Interview Questions
4 pages
FFFF
No ratings yet
FFFF
5 pages
Pradeep Resume
No ratings yet
Pradeep Resume
1 page
SQL Interview Questions
No ratings yet
SQL Interview Questions
4 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
72 pages
SQL Questions
No ratings yet
SQL Questions
7 pages
Relational Database Interview Questions With Sections
No ratings yet
Relational Database Interview Questions With Sections
5 pages
SQL Short Notes Top 10 Questions 1748266007
No ratings yet
SQL Short Notes Top 10 Questions 1748266007
8 pages
SQL Interview Guide For Three Previous Posts
No ratings yet
SQL Interview Guide For Three Previous Posts
15 pages
Aaaaaa
No ratings yet
Aaaaaa
15 pages
Mern Project
No ratings yet
Mern Project
2 pages
Oracle SQL and PL/SQL
From Everand
Oracle SQL and PL/SQL
Niraj Gupta
4.5/5 (8)
Interview Preparation Data Collection-01
No ratings yet
Interview Preparation Data Collection-01
14 pages
Data Engineer (3-5 Years of Experience.) PDF
No ratings yet
Data Engineer (3-5 Years of Experience.) PDF
7 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Manufacturing Cloud Professional Dumps
No ratings yet
Manufacturing Cloud Professional Dumps
14 pages
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
KPMG Data Analyst Interview Questions
No ratings yet
KPMG Data Analyst Interview Questions
30 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
DBMS Lab Manual
From Everand
DBMS Lab Manual
Jitendra Patel
1.5/5 (3)
TDA-C01 Tableau Exam Practice Questions
No ratings yet
TDA-C01 Tableau Exam Practice Questions
14 pages

Wipro Data Analyst Interview Questions

Uploaded by

Wipro Data Analyst Interview Questions

Uploaded by

Wipro Data Analyst

SQL & Database Questions

1. What are different types of SQL joins?

• Returns only matching rows between two tables.

SELECT e.employee_id, e.name, d.department_name

INNER JOIN departments d ON e.department_id = d.department_id;

2. LEFT JOIN (LEFT OUTER JOIN)

SELECT e.employee_id, e.name, d.department_name

LEFT JOIN departments d ON e.department_id = d.department_id;

3. RIGHT JOIN (RIGHT OUTER JOIN)

SELECT e.employee_id, e.name, d.department_name

RIGHT JOIN departments d ON e.department_id = d.department_id;

4. FULL JOIN (FULL OUTER JOIN)

SELECT e.employee_id, e.name, d.department_name

FULL JOIN departments d ON e.department_id = d.department_id;

• Produces the Cartesian product of both tables.

SELECT e.employee_id, e.name, d.department_name

CROSS JOIN departments d;

2. How do you find duplicate records in a table?

SELECT employee_id, COUNT(*)

HAVING COUNT(*) > 1;

To retrieve full duplicate records:

WHERE (SELECT COUNT(*) FROM employees e2 WHERE e1.name = e2.name) > 1;

To delete duplicates while keeping one:

DELETE FROM employees

WHERE employee_id NOT IN (

3. What is a window function in SQL? Give examples.

Assigns a unique row number within a partition.

SELECT employee_id, name, department_id,

ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS

Gives ranking with gaps for ties.

SELECT employee_id, name, department_id,

RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank

Same as RANK() but without gaps.

SELECT employee_id, name, department_id,

DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS

4. LAG() & LEAD()

Used to access previous or next row values.

LAG(salary) OVER (ORDER BY salary DESC) AS prev_salary,

LEAD(salary) OVER (ORDER BY salary DESC) AS next_salary

4. How do you optimize a slow SQL query?

CREATE INDEX idx_employee_id ON employees(employee_id);

2. **Avoid SELECT *** – Fetch only required columns.

SELECT employee_id, name FROM employees;

3. Use EXISTS instead of IN for subqueries

SELECT employee_id FROM employees WHERE EXISTS (

SELECT 1 FROM departments WHERE employees.department_id =

5. Use Query Execution Plan – Use EXPLAIN in MySQL or EXPLAIN ANALYZE in

EXPLAIN SELECT * FROM employees WHERE salary > 50000;

6. Partition Large Tables – If working with billions of records, partitioning helps.

CREATE TABLE employees_partitioned (

employee_id INT NOT NULL,

) PARTITION BY RANGE (salary) (

PARTITION p1 VALUES LESS THAN (30000),

PARTITION p2 VALUES LESS THAN (60000),

PARTITION p3 VALUES LESS THAN (100000)

5. Explain the difference between a CTE and a subquery.

Readability More readable & reusable Hard to read

Performance Optimized for repeated use May execute multiple times

Recursion Supports recursion No recursion support

Scope Exists only in query scope Limited within parent query

SELECT * FROM TopEmployees;

SELECT employee_id, name, salary

WHERE salary > (SELECT AVG(salary) FROM employees);

6. How would you design a normalized database schema?

1. First Normal Form (1NF):

o Ensure atomicity (no repeating groups).

2. Second Normal Form (2NF):

o Remove partial dependencies (all non-key attributes must depend on the

3. Third Normal Form (3NF):

o Remove transitive dependencies (attributes should depend only on the

OrderID CustomerName ProductName Price

1 Alice Laptop 1000

Normalized Schema (3NF)

CREATE TABLE Customers (

CustomerID INT PRIMARY KEY,

2. Avoid SELECT * – Fetch only required columns.