The document provides a comprehensive overview of various concepts in exploratory data analysis (EDA), including types of data, data visualization techniques, data cleaning, and imputation methods. It outlines key differences between primary, secondary, and tertiary data, discusses the significance of graphical representations, and highlights the importance of data quality and preprocessing. Additionally, it covers challenges in data accessing, feature engineering, and dimensionality reduction techniques, emphasizing their roles in effective data analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
17 views24 pages
EDA Question Bank Answers
The document provides a comprehensive overview of various concepts in exploratory data analysis (EDA), including types of data, data visualization techniques, data cleaning, and imputation methods. It outlines key differences between primary, secondary, and tertiary data, discusses the significance of graphical representations, and highlights the importance of data quality and preprocessing. Additionally, it covers challenges in data accessing, feature engineering, and dimensionality reduction techniques, emphasizing their roles in effective data analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24
EDA Question Bank Answers
1. Key Differences Between Primary, Secondary, and
Tertiary Data
- Question: What are the key differences between
Primary, Secondary, and Tertiary data? Provide examples of each. - Answer: - Primary Data: Data collected directly by the researcher for a specific purpose. - Example: Conducting surveys or interviews to gather customer feedback. - Secondary Data: Data collected by someone else for a different purpose but used by the researcher. - Example: Using government census data for market research. - Tertiary Data: Summarized or analyzed data, often derived from primary and secondary sources. - Example: Reading a review article that summarizes findings from multiple studies. - Diagram: A flowchart showing the relationship between primary, secondary, and tertiary data. 2. Bidimensional Graphical Representations
- Question: Explain Bidimensional Graphical
Representations and their significance in data visualization. - Answer: - Definition: Graphical representations that display the relationship between two variables. - Examples: - Scatter Plot: Shows the relationship between two continuous variables. - Heatmap: Displays the relationship between two categorical variables using color intensity. - Significance: Helps identify patterns, trends, and correlations between variables. - Example: A scatter plot showing the relationship between advertising spend and sales revenue. - Graph: A scatter plot with a trend line.
3. Distributional Assumptions in EDA
- Question: Discuss the Distributional Assumptions in
Exploratory Data Analysis (EDA). - Answer: - Definition: Assumptions about the distribution of data, such as normality, uniformity, or skewness. - Importance: These assumptions guide the choice of statistical tests and models. - Example: In hypothesis testing, the assumption of normality is crucial for tests like the t-test. - Graph: A histogram showing a normal distribution. 4. Exploratory Data Analysis (EDA)
- Question: Explain Exploratory Data Analysis (EDA) in
detail, highlighting its importance and techniques. - Answer: - Definition: EDA is the process of analyzing and summarizing datasets to understand their main characteristics, often using visual methods. - Techniques: - Summary Statistics: Mean, median, mode, standard deviation. - Visualizations: Histograms, box plots, scatter plots. - Data Cleaning: Handling missing values, outliers, and inconsistencies. - Importance: EDA helps identify patterns, trends, and anomalies in data. - Example: A data scientist uses EDA to analyze customer purchase behavior and identify key trends. - Diagram: A flowchart showing the steps in EDA (Data Collection → Data Cleaning → Visualization → Analysis). 5. Outliers and Their Types
- Question: Explain Outliers and their types. What
techniques can be used to identify outliers in a dataset? - Answer: - Definition: Outliers are data points that deviate significantly from other data points in a dataset. - Types: - Univariate Outliers: Outliers in a single variable (e.g., extremely high income). - Multivariate Outliers: Outliers in multiple variables (e.g., a person with high income but low spending). - Techniques to Identify Outliers: - Box Plot: Visualizes outliers using the interquartile range (IQR). - Z-Score: Identifies outliers based on standard deviations from the mean. - Example: In a dataset of house prices, a house priced at $10 million while most houses are priced between $100,000 and $500,000 is an outlier. - Graph: A box plot showing outliers.
6. Imputation Techniques for Missing Data
- Question: Describe the different types of imputation
techniques used to handle missing data. Provide examples of scenarios where each technique would be appropriate. - Answer: - Definition: Imputation is the process of replacing missing data with substituted values. - Techniques: - Mean/Median Imputation: Replace missing values with the mean or median of the variable. - Example: Replacing missing age values with the median age. - Regression Imputation: Predict missing values using regression models. - Example: Predicting missing income values based on education level. - K-Nearest Neighbors (KNN): Replace missing values with the average of the nearest neighbors. - Example: Replacing missing values in a dataset of customer transactions. - Diagram: A flowchart showing different imputation techniques.
7. Nominal, Ordinal, Interval, and Ratio Data
- Question: Differentiate between Nominal, Ordinal,
Interval, and Ratio data with examples. - Answer: - Nominal Data: Categories with no order (e.g., types of fruits). - Ordinal Data: Categories with order (e.g., education levels: high school, bachelor’s, master’s). - Interval Data: Numerical data with no true zero (e.g., temperature in Celsius). - Ratio Data: Numerical data with a true zero (e.g., height, weight). - Example: A survey collects nominal data (gender), ordinal data (satisfaction level), interval data (temperature), and ratio data (income). - Diagram: A table comparing the four types of data.
8. Types of Data
- Question: Identify the type of data for each case:
- a. Quarterly GDP growth rates of a country over five years. - b. Employment status of individuals tracked over five years. - c. Types of vegetables sold in a market. - d. Interviewing a scientist for their research findings. - e. Reading a scientific review article on a topic. - Answer: - a. Interval Data: Quarterly GDP growth rates are numerical with no true zero. - b. Nominal Data: Employment status is categorical with no order. - c. Nominal Data: Types of vegetables are categorical with no order. - d. Primary Data: Interviewing a scientist involves collecting data directly. - e. Tertiary Data: Reading a scientific review article involves summarized data.
9. Steps in Data Discovery
- Question: Explain the steps involved in Data Discovery
and how they help in data analysis. - Answer: - Steps: 1. Data Collection: Gather raw data from various sources. 2. Data Cleaning: Handle missing values, outliers, and inconsistencies. 3. Data Exploration: Use visualizations and summary statistics to understand the data. 4. Data Analysis: Apply statistical techniques to uncover patterns and insights. - Example: A data scientist discovers patterns in customer data by following these steps. - Diagram: A flowchart showing the data discovery process.
10. Unidimensional Graphical Representations
- Question: Explain Unidimensional Graphical
Representations and their importance in data visualization. - Answer: - Definition: Graphical representations that display the distribution of a single variable. - Examples: - Histogram: Shows the distribution of a continuous variable. - Bar Chart: Displays the frequency of categorical variables. - Importance: Helps in understanding the distribution and central tendency of a single variable. - Example: A histogram showing the distribution of ages in a population. - Graph: A histogram.
11. Data Quality Issues
- Question: Describe the common types of data quality
issues encountered in raw datasets. Provide examples of how each issue can affect data analysis. - Answer: - Common Issues: - Missing Data: Data points that are not recorded. - Duplicate Data: Repeated entries in the dataset. - Inconsistent Data: Data that does not follow a consistent format. - Example: A dataset with missing values can lead to inaccurate analysis. - Diagram: A flowchart showing data quality issues and solutions.
12. Challenges in Data Accessing
- Question: Mention the challenges and issues related to
Data Accessing in business analytics. - Answer: - Challenges: - Data Privacy: Ensuring that sensitive data is protected. - Data Security: Preventing unauthorized access to data. - Data Accessibility: Ensuring that data is easily accessible to authorized users. - Example: A company faces challenges in accessing customer data due to privacy regulations.
13. Data Preprocessing
- Question: Define Data Preprocessing and explain its
role in improving data quality. - Answer: - Definition: The process of cleaning and transforming raw data into a usable format. - Role: Improves data quality and ensures that the data is ready for analysis. - Example: A dataset of customer transactions is preprocessed by removing duplicates and handling missing values. - Diagram: A flowchart showing the data preprocessing steps. 14. Types of Missing Data
- Question: What are the different types of missing data
(MCAR, MAR, MNAR)? Explain with examples. - Answer: - MCAR (Missing Completely at Random): Missing data is unrelated to any other variable. - Example: A survey respondent accidentally skips a question. - MAR (Missing at Random): Missing data is related to other observed variables. - Example: Younger respondents are less likely to report their income. - MNAR (Missing Not at Random): Missing data is related to the missing values themselves. - Example: High-income individuals are less likely to report their income. - Diagram: A flowchart showing the types of missing data.
15. Feature Engineering
- Question: Discuss the importance of feature
engineering in data analysis. - Answer: - Definition: The process of creating new features or transforming existing ones to improve model performance. - Importance: Enhances the predictive power of machine learning models. - Example: Creating a "day of the week" feature from a timestamp to predict customer behavior. - Diagram: A flowchart showing the feature engineering process.
16. Data Transformation Techniques
- Question: Explain data transformation techniques and
their applications in data preprocessing. - Answer: - Definition: Techniques used to transform data into a suitable format for analysis. - Examples: - Normalization: Scaling data to a range (e.g., 0 to 1). - Standardization: Scaling data to have a mean of 0 and a standard deviation of 1. - Example: Normalizing pixel values in an image dataset for machine learning. - Diagram: A flowchart showing data transformation techniques. 17. Data Normalization vs. Standardization
- Question: What is data normalization? How does it
differ from data standardization? - Answer: - Normalization: Scales data to a range (e.g., 0 to 1). - Example: Scaling customer age data to a range of 0 to 1. - Standardization: Scales data to have a mean of 0 and a standard deviation of 1. - Example: Standardizing features in a dataset for linear regression. - Diagram: A comparison chart showing normalization and standardization.
18. Correlation Analysis
- Question: Discuss the role of correlation analysis in
data exploration. - Answer: - Definition: Measures the strength and direction of the relationship between two variables. - Role: Helps identify relationships between variables in data exploration. - Example: A correlation analysis between advertising spend and sales revenue. - Graph: A scatter plot showing the correlation between two variables.
19. Data Sampling Techniques
- Question: Explain the different types of data sampling
techniques used in data analysis. - Answer: - Random Sampling: Every individual in the population has an equal chance of being selected. - Example: A lottery system where each ticket has an equal chance of being drawn. - Stratified Sampling: The population is divided into strata (subgroups), and samples are taken from each stratum. - Example: A researcher divides a population into age groups (e.g., 18-25, 26-35) and samples from each group. - Cluster Sampling: The population is divided into clusters, and entire clusters are randomly selected for analysis. - Example: A company divides its customers by region and randomly selects a few regions to survey. - Diagram: A comparison chart showing random, stratified, and cluster sampling. 20. Importance of Data Cleaning
- Question: Discuss the importance of Data Cleaning in
business analytics. - Answer: - Definition: The process of detecting and correcting (or removing) errors, inconsistencies, and inaccuracies in a dataset. - Importance: - Improves Data Quality: Clean data ensures accurate and reliable analysis. - Enhances Decision-Making: Clean data leads to better insights and decisions. - Saves Time and Resources: Cleaning data upfront reduces the need for rework during analysis. - Example: A dataset of customer transactions is cleaned by removing duplicate entries, handling missing values, and correcting inconsistent formatting. - Diagram: A flowchart showing the data cleaning process.
21. Handling Duplicate Data
- Question: What are the various techniques to handle
duplicate data in a dataset? - Answer: - Techniques: - Removing Duplicates: Deleting repeated entries. - Example: Removing duplicate customer records from a database. - Merging Duplicates: Combining duplicate records into a single entry. - Example: Merging duplicate customer records with the same name and address. - Importance: Handling duplicates ensures data accuracy and consistency. - Example: A retail company removes duplicate entries from its sales dataset to ensure accurate revenue calculations. - Diagram: A flowchart showing the process of handling duplicate data.
22. Data Aggregation
- Question: Explain the concept of data aggregation and
its significance in analytics. - Answer: - Definition: Data aggregation is the process of summarizing data into a more usable format, such as totals, averages, or counts. - Significance: - Simplifies Analysis: Aggregated data is easier to analyze and interpret. - Identifies Trends: Aggregation helps identify patterns and trends in large datasets. - Example: A company aggregates daily sales data into monthly sales totals to analyze seasonal trends. - Diagram: A bar chart showing monthly sales totals.
23. Visualization Techniques for Univariate and
Multivariate Data
- Question: What are the common visualization
techniques used for univariate and multivariate data? - Answer: - Univariate Data: Data involving a single variable. - Visualization Techniques: - Histogram: Shows the distribution of a continuous variable. - Bar Chart: Displays the frequency of categorical variables. - Example: A histogram showing the distribution of ages in a population. - Multivariate Data: Data involving multiple variables. - Visualization Techniques: - Scatter Plot: Shows the relationship between two continuous variables. - Heatmap: Displays the relationship between two categorical variables using color intensity. - Example: A scatter plot showing the relationship between advertising spend and sales revenue. - Graph: A scatter plot with a trend line.
24. Box Plot and Histogram
- Question: Discuss Box Plot and Histogram as
graphical tools for data distribution analysis. - Answer: - Box Plot: - Definition: A graphical representation of data using quartiles to show the distribution and identify outliers. - Use: Helps visualize the spread and skewness of data. - Example: A box plot showing the distribution of house prices. - Histogram: - Definition: A graphical representation of the frequency distribution of a continuous variable. - Use: Helps understand the distribution and central tendency of data. - Example: A histogram showing the distribution of employee salaries. - Diagram: A side-by-side comparison of a box plot and a histogram. 25. Dimensionality Reduction Techniques (PCA)
- Question: Explain the importance of dimensionality
reduction techniques like PCA in data analysis. - Answer: - Definition: Dimensionality reduction techniques reduce the number of features in a dataset while preserving the most important information. - Principal Component Analysis (PCA): - Definition: A technique that transforms data into a lower-dimensional space by identifying the directions (principal components) that maximize variance. - Use: Reduces the complexity of data while retaining its structure. - Example: Reducing the number of features in an image dataset from 1000 to 50 for faster processing. - Diagram: A graph showing the original data and the reduced-dimensional data after PCA.
26. Steps in Data Wrangling
- Question: Describe the steps involved in data
wrangling and their importance. - Answer: - Definition: Data wrangling is the process of cleaning, transforming, and integrating raw data into a usable format for analysis. - Steps: 1. Data Collection: Gather raw data from various sources. 2. Data Cleaning: Handle missing values, outliers, and inconsistencies. 3. Data Transformation: Normalize, standardize, or aggregate data. 4. Data Integration: Combine data from multiple sources. - Importance: Ensures that the data is ready for analysis. - Example: A data scientist wrangles customer data by cleaning, transforming, and integrating it into a single dataset for analysis. - Diagram: A flowchart showing the data wrangling process.
27. Data Integrity and Security
- Question: Explain the role of data integrity and security
in analytics. - Answer: - Definition: Data integrity refers to the accuracy and consistency of data, while data security involves protecting data from unauthorized access or breaches. - Role: - Data Integrity: Ensures that data is accurate and reliable for analysis. - Data Security: Protects sensitive data from breaches and cyberattacks. - Example: A company implements encryption and access controls to protect customer data. - Diagram: A flowchart showing data integrity and security measures.
28. Impact of Biased Data
- Question: Discuss the impact of biased data in
decision-making and machine learning models. - Answer: - Definition: Biased data is data that is not representative of the population, leading to skewed results and inaccurate conclusions. - Impact: - Inaccurate Decisions: Biased data can lead to poor business decisions. - Biased Machine Learning Models: Models trained on biased data will produce biased predictions. - Example: A biased dataset of job applicants leads to discriminatory hiring practices. - Diagram: A graph showing the impact of biased data on model predictions.
29. Handling Categorical Data
- Question: What are the best practices for handling
categorical data in business analytics? - Answer: - Techniques: - One-Hot Encoding: Converts categorical variables into binary vectors. - Example: Converting "gender" into binary columns (Male: 1 or 0, Female: 1 or 0). - Label Encoding: Assigns a unique number to each category. - Example: Converting "product type" into numerical values (e.g., Electronics: 1, Clothing: 2). - Importance: Proper handling of categorical data ensures accurate analysis and model performance. - Example: A machine learning model uses one-hot encoding to process categorical data like product categories. - Diagram: A flowchart showing the process of handling categorical data. 30. Data Governance
- Question: Explain the role of data governance in
ensuring high-quality data management. - Answer: - Definition: Data governance refers to the policies, processes, and standards for managing data quality, security, and accessibility. - Role: - Ensures Data Quality: Maintains accurate and consistent data. - Protects Data Security: Implements measures to prevent unauthorized access. - Promotes Data Accessibility: Ensures that data is easily accessible to authorized users. - Example: A company implements data governance policies to ensure that customer data is accurate, secure, and accessible. - Diagram: A flowchart showing the components of data governance.
THE APPLIED DATA SCIENCE WORKSHOP Urinary Biomarkers Based Pancreatic Cancer Classification and Prediction (Vivian Siahaan Rismon Hasiholan Sianipar) (Z-Library)