0% found this document useful (0 votes)
17 views24 pages

EDA Question Bank Answers

The document provides a comprehensive overview of various concepts in exploratory data analysis (EDA), including types of data, data visualization techniques, data cleaning, and imputation methods. It outlines key differences between primary, secondary, and tertiary data, discusses the significance of graphical representations, and highlights the importance of data quality and preprocessing. Additionally, it covers challenges in data accessing, feature engineering, and dimensionality reduction techniques, emphasizing their roles in effective data analysis.

Uploaded by

rihehoj301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

EDA Question Bank Answers

The document provides a comprehensive overview of various concepts in exploratory data analysis (EDA), including types of data, data visualization techniques, data cleaning, and imputation methods. It outlines key differences between primary, secondary, and tertiary data, discusses the significance of graphical representations, and highlights the importance of data quality and preprocessing. Additionally, it covers challenges in data accessing, feature engineering, and dimensionality reduction techniques, emphasizing their roles in effective data analysis.

Uploaded by

rihehoj301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

EDA Question Bank Answers

1. Key Differences Between Primary, Secondary, and


Tertiary Data

- Question: What are the key differences between


Primary, Secondary, and Tertiary data? Provide
examples of each.
- Answer:
- Primary Data: Data collected directly by the
researcher for a specific purpose.
- Example: Conducting surveys or interviews to gather
customer feedback.
- Secondary Data: Data collected by someone else for
a different purpose but used by the researcher.
- Example: Using government census data for market
research.
- Tertiary Data: Summarized or analyzed data, often
derived from primary and secondary sources.
- Example: Reading a review article that summarizes
findings from multiple studies.
- Diagram: A flowchart showing the relationship
between primary, secondary, and tertiary data.
2. Bidimensional Graphical Representations

- Question: Explain Bidimensional Graphical


Representations and their significance in data
visualization.
- Answer:
- Definition: Graphical representations that display the
relationship between two variables.
- Examples:
- Scatter Plot: Shows the relationship between two
continuous variables.
- Heatmap: Displays the relationship between two
categorical variables using color intensity.
- Significance: Helps identify patterns, trends, and
correlations between variables.
- Example: A scatter plot showing the relationship
between advertising spend and sales revenue.
- Graph: A scatter plot with a trend line.

3. Distributional Assumptions in EDA

- Question: Discuss the Distributional Assumptions in


Exploratory Data Analysis (EDA).
- Answer:
- Definition: Assumptions about the distribution of data,
such as normality, uniformity, or skewness.
- Importance: These assumptions guide the choice of
statistical tests and models.
- Example: In hypothesis testing, the assumption of
normality is crucial for tests like the t-test.
- Graph: A histogram showing a normal distribution.
4. Exploratory Data Analysis (EDA)

- Question: Explain Exploratory Data Analysis (EDA) in


detail, highlighting its importance and techniques.
- Answer:
- Definition: EDA is the process of analyzing and
summarizing datasets to understand their main
characteristics, often using visual methods.
- Techniques:
- Summary Statistics: Mean, median, mode, standard
deviation.
- Visualizations: Histograms, box plots, scatter plots.
- Data Cleaning: Handling missing values, outliers,
and inconsistencies.
- Importance: EDA helps identify patterns, trends, and
anomalies in data.
- Example: A data scientist uses EDA to analyze
customer purchase behavior and identify key trends.
- Diagram: A flowchart showing the steps in EDA
(Data Collection → Data Cleaning → Visualization
→ Analysis).
5. Outliers and Their Types

- Question: Explain Outliers and their types. What


techniques can be used to identify outliers in a dataset?
- Answer:
- Definition: Outliers are data points that deviate
significantly from other data points in a dataset.
- Types:
- Univariate Outliers: Outliers in a single variable (e.g.,
extremely high income).
- Multivariate Outliers: Outliers in multiple variables
(e.g., a person with high income but low spending).
- Techniques to Identify Outliers:
- Box Plot: Visualizes outliers using the interquartile
range (IQR).
- Z-Score: Identifies outliers based on standard
deviations from the mean.
- Example: In a dataset of house prices, a house priced
at $10 million while most houses are priced between
$100,000 and $500,000 is an outlier.
- Graph: A box plot showing outliers.

6. Imputation Techniques for Missing Data

- Question: Describe the different types of imputation


techniques used to handle missing data. Provide
examples of scenarios where each technique would be
appropriate.
- Answer:
- Definition: Imputation is the process of replacing
missing data with substituted values.
- Techniques:
- Mean/Median Imputation: Replace missing values
with the mean or median of the variable.
- Example: Replacing missing age values with the
median age.
- Regression Imputation: Predict missing values using
regression models.
- Example: Predicting missing income values based
on education level.
- K-Nearest Neighbors (KNN): Replace missing
values with the average of the nearest neighbors.
- Example: Replacing missing values in a dataset of
customer transactions.
- Diagram: A flowchart showing different imputation
techniques.

7. Nominal, Ordinal, Interval, and Ratio Data

- Question: Differentiate between Nominal, Ordinal,


Interval, and Ratio data with examples.
- Answer:
- Nominal Data: Categories with no order (e.g., types of
fruits).
- Ordinal Data: Categories with order (e.g., education
levels: high school, bachelor’s, master’s).
- Interval Data: Numerical data with no true zero (e.g.,
temperature in Celsius).
- Ratio Data: Numerical data with a true zero (e.g.,
height, weight).
- Example: A survey collects nominal data (gender),
ordinal data (satisfaction level), interval data
(temperature), and ratio data (income).
- Diagram: A table comparing the four types of data.

8. Types of Data

- Question: Identify the type of data for each case:


- a. Quarterly GDP growth rates of a country over five
years.
- b. Employment status of individuals tracked over five
years.
- c. Types of vegetables sold in a market.
- d. Interviewing a scientist for their research findings.
- e. Reading a scientific review article on a topic.
- Answer:
- a. Interval Data: Quarterly GDP growth rates are
numerical with no true zero.
- b. Nominal Data: Employment status is categorical
with no order.
- c. Nominal Data: Types of vegetables are categorical
with no order.
- d. Primary Data: Interviewing a scientist involves
collecting data directly.
- e. Tertiary Data: Reading a scientific review article
involves summarized data.

9. Steps in Data Discovery

- Question: Explain the steps involved in Data Discovery


and how they help in data analysis.
- Answer:
- Steps:
1. Data Collection: Gather raw data from various
sources.
2. Data Cleaning: Handle missing values, outliers,
and inconsistencies.
3. Data Exploration: Use visualizations and summary
statistics to understand the data.
4. Data Analysis: Apply statistical techniques to
uncover patterns and insights.
- Example: A data scientist discovers patterns in
customer data by following these steps.
- Diagram: A flowchart showing the data discovery
process.

10. Unidimensional Graphical Representations

- Question: Explain Unidimensional Graphical


Representations and their importance in data
visualization.
- Answer:
- Definition: Graphical representations that display the
distribution of a single variable.
- Examples:
- Histogram: Shows the distribution of a continuous
variable.
- Bar Chart: Displays the frequency of categorical
variables.
- Importance: Helps in understanding the distribution
and central tendency of a single variable.
- Example: A histogram showing the distribution of
ages in a population.
- Graph: A histogram.

11. Data Quality Issues

- Question: Describe the common types of data quality


issues encountered in raw datasets. Provide examples
of how each issue can affect data analysis.
- Answer:
- Common Issues:
- Missing Data: Data points that are not recorded.
- Duplicate Data: Repeated entries in the dataset.
- Inconsistent Data: Data that does not follow a
consistent format.
- Example: A dataset with missing values can lead to
inaccurate analysis.
- Diagram: A flowchart showing data quality issues and
solutions.

12. Challenges in Data Accessing

- Question: Mention the challenges and issues related to


Data Accessing in business analytics.
- Answer:
- Challenges:
- Data Privacy: Ensuring that sensitive data is
protected.
- Data Security: Preventing unauthorized access to
data.
- Data Accessibility: Ensuring that data is easily
accessible to authorized users.
- Example: A company faces challenges in accessing
customer data due to privacy regulations.

13. Data Preprocessing

- Question: Define Data Preprocessing and explain its


role in improving data quality.
- Answer:
- Definition: The process of cleaning and transforming
raw data into a usable format.
- Role: Improves data quality and ensures that the data
is ready for analysis.
- Example: A dataset of customer transactions is
preprocessed by removing duplicates and handling
missing values.
- Diagram: A flowchart showing the data preprocessing
steps.
14. Types of Missing Data

- Question: What are the different types of missing data


(MCAR, MAR, MNAR)? Explain with examples.
- Answer:
- MCAR (Missing Completely at Random): Missing data
is unrelated to any other variable.
- Example: A survey respondent accidentally skips a
question.
- MAR (Missing at Random): Missing data is related to
other observed variables.
- Example: Younger respondents are less likely to
report their income.
- MNAR (Missing Not at Random): Missing data is
related to the missing values themselves.
- Example: High-income individuals are less likely to
report their income.
- Diagram: A flowchart showing the types of missing
data.

15. Feature Engineering

- Question: Discuss the importance of feature


engineering in data analysis.
- Answer:
- Definition: The process of creating new features or
transforming existing ones to improve model
performance.
- Importance: Enhances the predictive power of
machine learning models.
- Example: Creating a "day of the week" feature from a
timestamp to predict customer behavior.
- Diagram: A flowchart showing the feature engineering
process.

16. Data Transformation Techniques

- Question: Explain data transformation techniques and


their applications in data preprocessing.
- Answer:
- Definition: Techniques used to transform data into a
suitable format for analysis.
- Examples:
- Normalization: Scaling data to a range (e.g., 0 to 1).
- Standardization: Scaling data to have a mean of 0
and a standard deviation of 1.
- Example: Normalizing pixel values in an image
dataset for machine learning.
- Diagram: A flowchart showing data transformation
techniques.
17. Data Normalization vs. Standardization

- Question: What is data normalization? How does it


differ from data standardization?
- Answer:
- Normalization: Scales data to a range (e.g., 0 to 1).
- Example: Scaling customer age data to a range of 0
to 1.
- Standardization: Scales data to have a mean of 0 and
a standard deviation of 1.
- Example: Standardizing features in a dataset for
linear regression.
- Diagram: A comparison chart showing normalization
and standardization.

18. Correlation Analysis

- Question: Discuss the role of correlation analysis in


data exploration.
- Answer:
- Definition: Measures the strength and direction of the
relationship between two variables.
- Role: Helps identify relationships between variables in
data exploration.
- Example: A correlation analysis between advertising
spend and sales revenue.
- Graph: A scatter plot showing the correlation between
two variables.

19. Data Sampling Techniques

- Question: Explain the different types of data sampling


techniques used in data analysis.
- Answer:
- Random Sampling: Every individual in the population
has an equal chance of being selected.
- Example: A lottery system where each ticket has an
equal chance of being drawn.
- Stratified Sampling: The population is divided into
strata (subgroups), and samples are taken from each
stratum.
- Example: A researcher divides a population into age
groups (e.g., 18-25, 26-35) and samples from each
group.
- Cluster Sampling: The population is divided into
clusters, and entire clusters are randomly selected for
analysis.
- Example: A company divides its customers by
region and randomly selects a few regions to survey.
- Diagram: A comparison chart showing random,
stratified, and cluster sampling.
20. Importance of Data Cleaning

- Question: Discuss the importance of Data Cleaning in


business analytics.
- Answer:
- Definition: The process of detecting and correcting (or
removing) errors, inconsistencies, and inaccuracies in a
dataset.
- Importance:
- Improves Data Quality: Clean data ensures accurate
and reliable analysis.
- Enhances Decision-Making: Clean data leads to
better insights and decisions.
- Saves Time and Resources: Cleaning data upfront
reduces the need for rework during analysis.
- Example: A dataset of customer transactions is
cleaned by removing duplicate entries, handling missing
values, and correcting inconsistent formatting.
- Diagram: A flowchart showing the data cleaning
process.

21. Handling Duplicate Data

- Question: What are the various techniques to handle


duplicate data in a dataset?
- Answer:
- Techniques:
- Removing Duplicates: Deleting repeated entries.
- Example: Removing duplicate customer records
from a database.
- Merging Duplicates: Combining duplicate records
into a single entry.
- Example: Merging duplicate customer records with
the same name and address.
- Importance: Handling duplicates ensures data
accuracy and consistency.
- Example: A retail company removes duplicate entries
from its sales dataset to ensure accurate revenue
calculations.
- Diagram: A flowchart showing the process of handling
duplicate data.

22. Data Aggregation

- Question: Explain the concept of data aggregation and


its significance in analytics.
- Answer:
- Definition: Data aggregation is the process of
summarizing data into a more usable format, such as
totals, averages, or counts.
- Significance:
- Simplifies Analysis: Aggregated data is easier to
analyze and interpret.
- Identifies Trends: Aggregation helps identify patterns
and trends in large datasets.
- Example: A company aggregates daily sales data into
monthly sales totals to analyze seasonal trends.
- Diagram: A bar chart showing monthly sales totals.

23. Visualization Techniques for Univariate and


Multivariate Data

- Question: What are the common visualization


techniques used for univariate and multivariate data?
- Answer:
- Univariate Data: Data involving a single variable.
- Visualization Techniques:
- Histogram: Shows the distribution of a continuous
variable.
- Bar Chart: Displays the frequency of categorical
variables.
- Example: A histogram showing the distribution of
ages in a population.
- Multivariate Data: Data involving multiple variables.
- Visualization Techniques:
- Scatter Plot: Shows the relationship between two
continuous variables.
- Heatmap: Displays the relationship between two
categorical variables using color intensity.
- Example: A scatter plot showing the relationship
between advertising spend and sales revenue.
- Graph: A scatter plot with a trend line.

24. Box Plot and Histogram

- Question: Discuss Box Plot and Histogram as


graphical tools for data distribution analysis.
- Answer:
- Box Plot:
- Definition: A graphical representation of data using
quartiles to show the distribution and identify outliers.
- Use: Helps visualize the spread and skewness of
data.
- Example: A box plot showing the distribution of
house prices.
- Histogram:
- Definition: A graphical representation of the
frequency distribution of a continuous variable.
- Use: Helps understand the distribution and central
tendency of data.
- Example: A histogram showing the distribution of
employee salaries.
- Diagram: A side-by-side comparison of a box plot and
a histogram.
25. Dimensionality Reduction Techniques (PCA)

- Question: Explain the importance of dimensionality


reduction techniques like PCA in data analysis.
- Answer:
- Definition: Dimensionality reduction techniques
reduce the number of features in a dataset while
preserving the most important information.
- Principal Component Analysis (PCA):
- Definition: A technique that transforms data into a
lower-dimensional space by identifying the directions
(principal components) that maximize variance.
- Use: Reduces the complexity of data while retaining
its structure.
- Example: Reducing the number of features in an
image dataset from 1000 to 50 for faster processing.
- Diagram: A graph showing the original data and the
reduced-dimensional data after PCA.

26. Steps in Data Wrangling

- Question: Describe the steps involved in data


wrangling and their importance.
- Answer:
- Definition: Data wrangling is the process of cleaning,
transforming, and integrating raw data into a usable
format for analysis.
- Steps:
1. Data Collection: Gather raw data from various
sources.
2. Data Cleaning: Handle missing values, outliers,
and inconsistencies.
3. Data Transformation: Normalize, standardize, or
aggregate data.
4. Data Integration: Combine data from multiple
sources.
- Importance: Ensures that the data is ready for
analysis.
- Example: A data scientist wrangles customer data by
cleaning, transforming, and integrating it into a single
dataset for analysis.
- Diagram: A flowchart showing the data wrangling
process.

27. Data Integrity and Security

- Question: Explain the role of data integrity and security


in analytics.
- Answer:
- Definition: Data integrity refers to the accuracy and
consistency of data, while data security involves
protecting data from unauthorized access or breaches.
- Role:
- Data Integrity: Ensures that data is accurate and
reliable for analysis.
- Data Security: Protects sensitive data from breaches
and cyberattacks.
- Example: A company implements encryption and
access controls to protect customer data.
- Diagram: A flowchart showing data integrity and
security measures.

28. Impact of Biased Data

- Question: Discuss the impact of biased data in


decision-making and machine learning models.
- Answer:
- Definition: Biased data is data that is not
representative of the population, leading to skewed
results and inaccurate conclusions.
- Impact:
- Inaccurate Decisions: Biased data can lead to poor
business decisions.
- Biased Machine Learning Models: Models trained on
biased data will produce biased predictions.
- Example: A biased dataset of job applicants leads to
discriminatory hiring practices.
- Diagram: A graph showing the impact of biased data
on model predictions.

29. Handling Categorical Data

- Question: What are the best practices for handling


categorical data in business analytics?
- Answer:
- Techniques:
- One-Hot Encoding: Converts categorical variables
into binary vectors.
- Example: Converting "gender" into binary columns
(Male: 1 or 0, Female: 1 or 0).
- Label Encoding: Assigns a unique number to each
category.
- Example: Converting "product type" into numerical
values (e.g., Electronics: 1, Clothing: 2).
- Importance: Proper handling of categorical data
ensures accurate analysis and model performance.
- Example: A machine learning model uses one-hot
encoding to process categorical data like product
categories.
- Diagram: A flowchart showing the process of handling
categorical data.
30. Data Governance

- Question: Explain the role of data governance in


ensuring high-quality data management.
- Answer:
- Definition: Data governance refers to the policies,
processes, and standards for managing data quality,
security, and accessibility.
- Role:
- Ensures Data Quality: Maintains accurate and
consistent data.
- Protects Data Security: Implements measures to
prevent unauthorized access.
- Promotes Data Accessibility: Ensures that data is
easily accessible to authorized users.
- Example: A company implements data governance
policies to ensure that customer data is accurate,
secure, and accessible.
- Diagram: A flowchart showing the components of
data governance.

You might also like