PDS Question Bank
PDS Question Bank
Introduction: Need for data science - Benefits and uses - Causality and Experimentation - Facets
of data-Data science process: Retrieving data - Cleansing, integrating and transforming data -
Exploratory Data Analysis - Build the models - Presenting findings and building applications
1. Retrieving data
2. Cleansing, integrating, and transforming data
3. Exploratory Data Analysis (EDA)
4. Building models
5. Presenting findings and building applications
1. Explain the need for data science and its benefits and uses.
o Answer:
1. Need for Data Science:
The exponential growth of data requires tools to analyze and
derive insights.
Traditional methods are insufficient for handling large-scale,
complex datasets.
Data science provides techniques to predict outcomes and
improve decision-making.
2. Benefits:
Drives business decisions through data-driven insights.
Enhances operational efficiency and cost savings.
Provides predictive and prescriptive analytics for future
planning.
3. Uses:
In healthcare: Predicting diseases and optimizing treatments.
In finance: Fraud detection and credit risk analysis.
In retail: Personalized recommendations and inventory
management.
6. Describe Exploratory Data Analysis (EDA) and its role in the data science
process.
o Answer:
1. Definition:
EDA is the process of summarizing and visualizing data to
identify patterns, relationships, and anomalies.
2. Steps in EDA:
Visualizing distributions using histograms or density plots.
Checking relationships between variables using scatter plots.
Identifying missing values or outliers.
3. Role in Data Science:
Provides insights to guide feature engineering.
Detects potential issues in data quality.
4. Example:
Analyzing customer purchase data to understand buying
patterns.
5. Tools:
Libraries like pandas, matplotlib, and seaborn.
7. Explain the importance of presenting findings and building applications in data
science.
o Answer:
1. Presenting Findings:
Converts complex data insights into actionable visuals like
dashboards, reports, and graphs.
Example: A sales dashboard showing trends and forecasts.
2. Building Applications:
Embeds data science models into software tools for automation
and scalability.
Example: A recommendation system for e-commerce websites.
3. Importance:
Improves decision-making for stakeholders.
Enhances user experience with real-time insights.
Bridges the gap between data analysis and business application.
3. Explain the key pandas data structures (Series and DataFrame) with examples.
o Answer:
1. Series:
One-dimensional labeled array:
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
2. DataFrame:
Two-dimensional labeled data structure:
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
3. Key Operations:
Indexing: Access rows or columns using labels.
Selection: Use loc[] or iloc[] for label-based or position-based
selection.
4. Significance:
Ideal for structured data.
Simplifies data manipulation.
5. What is an outlier?
o Answer: An outlier is a data point that deviates significantly from the rest of
the dataset and can affect analysis results.
1. Explain the steps involved in data cleaning, including handling missing data and
removing duplicates.
o Answer:
1. Definition: Data cleaning involves detecting and correcting errors in
datasets to ensure reliability.
2. Steps:
Handling Missing Data:
Drop rows/columns with missing values using dropna().
Fill missing values using fillna() with:
Mean, median, or mode for numerical data.
Specific values for categorical data.
Removing Duplicates:
Detect duplicates using duplicated().
Remove duplicates using drop_duplicates().
3. Importance:
Reduces noise.
Ensures consistency in analysis.
Improves model accuracy.
2. Describe data transformation techniques with examples of removing duplicates,
mapping, replacing values, and filtering outliers.
o Answer:
1. Removing Duplicates:
Example: df.drop_duplicates().
2. Transforming Data Using a Function or Mapping:
Example: Map categorical values to numbers using
df['column'].map({'A': 1, 'B': 2}).
3. Replacing Values:
Example: Replace incorrect entries using
df.replace({'old_value': 'new_value'}).
4. Filtering Outliers:
Use statistical methods like:
IQR Method: Filter out values outside Q1−1.5×IQRQ1
- 1.5 \times IQR and Q3+1.5×IQRQ3 + 1.5 \times IQR.
Z-Score: Remove values with a z-score > 3.
5. Significance:
Prepares data for accurate analysis.
Reduces bias and anomalies.
4. Describe the various types of plots in pandas and matplotlib with examples.
o Answer:
1. Line Plot:
Used for visualizing trends over time.
Example:
df.plot.line(x='date', y='value')
2. Bar Plot:
Visualizes categorical data.
Example:
df['category'].value_counts().plot.bar()
3. Histogram:
Displays data distribution.
Example:
df['column'].plot.hist(bins=10)
4. Density Plot:
Shows the probability density function.
Example:
df['column'].plot.kde()
5. Scatter Plot:
Visualizes relationships between two variables.
Example:
df.plot.scatter(x='x_column', y='y_column')
6. Significance:
Helps in data exploration.
Identifies patterns, trends, and relationships.
5. What is skewness?
o Answer: Skewness measures the asymmetry of a data distribution. A
distribution can be positively skewed (right-skewed), negatively skewed (left-
skewed), or symmetric.
2. What is Exploratory Data Analysis (EDA)? Explain the techniques used in EDA.
o Answer:
1. Definition: EDA involves analyzing datasets to summarize their main
characteristics using visual and statistical methods.
2. Techniques:
Data Summarization: Calculate measures like mean, median,
and standard deviation.
Data Distribution: Use histograms, boxplots, and density plots
to understand spread.
Outlier Detection: Identify outliers using boxplots or z-scores.
Measuring Asymmetry: Compute skewness and kurtosis.
Visualizations: Use scatter plots, pair plots, and heatmaps.
3. Applications:
Detect anomalies.
Discover patterns.
Guide feature engineering.