0% found this document useful (0 votes)
23 views22 pages

DSA Question Bank

Uploaded by

Pranav Deore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

DSA Question Bank

Uploaded by

Pranav Deore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

QUESTION BANK

1. Discuss different types of digital data


Digital data can be categorized into three main types:

Structured Data: This data is highly organized in a predefined format like


tables, making it easy to search, store, and process. It is often found in
relational databases and spreadsheets. Examples include customer records,
transactions, and financial data.

Unstructured Data: Unlike structured data, unstructured data has no specific


format. It includes a wide range of data types, such as text, images, videos,
and social media posts. This data is more difficult to analyze directly but holds
valuable insights when processed correctly.

Semi-Structured Data: Semi-structured data contains both structured and


unstructured elements. While it may have some organizational properties (like
tags or metadata), it doesn't fit neatly into a traditional relational model.
Examples include JSON, XML files, and NoSQL databases. This type of data is
often used in big data applications.

2. Elaborate scope of Data Science


The scope of data science is vast, as it has applications across multiple industries
and domains:

Predictive Analytics: It helps forecast future events by using historical data.


This is crucial in sectors like healthcare (predicting disease outbreaks),
finance (stock market predictions), and retail (predicting customer behavior).

Business Intelligence: Data science enables businesses to analyze data to


drive decision-making, uncover insights into operations, and improve strategy.

Data Mining and Knowledge Discovery: Involves finding hidden patterns and
trends in large datasets, which can inform decisions in e-commerce, customer
relations, and more.

QUESTION BANK 1
AI and ML: Develops models that perform tasks like language translation,
image recognition, and recommendation systems. These have practical
applications in autonomous vehicles, personalized marketing, and more.

Data-Driven Decision Making: With a focus on using data to inform decisions,


data science helps organizations remain competitive, optimize processes, and
enhance customer experience.

3. Justify need of data analysis in Retail Industry


Data analysis is critical in the retail industry due to the highly competitive and fast-
changing nature of the market:

Customer Insights: By analyzing purchasing behavior, preferences, and


demographics, retailers can segment their customer base, providing
personalized services and promotions that increase engagement.

Inventory Management: Data analysis helps forecast demand, optimize stock


levels, and prevent overstock or stockouts, leading to reduced operational
costs and enhanced customer satisfaction.

Pricing Optimization: Dynamic pricing models based on customer data,


competitor prices, and market demand enable retailers to adjust pricing in
real-time for maximum profitability.

Sales and Marketing: Through analyzing trends, seasonality, and promotions,


data analysis can help devise effective marketing strategies, ensuring higher
returns on investment for advertising campaigns.

4. Discuss need of data visualization & classify data visualization


techniques
Data visualization helps transform raw data into intuitive, visual formats, making it
easier to interpret and act upon:

Need: Visualization aids in recognizing patterns, trends, and correlations


quickly. It supports effective decision-making and is critical for understanding
complex datasets in a way that raw numbers cannot. Visualizations also
provide clarity when communicating results to stakeholders.

QUESTION BANK 2
Techniques:

Descriptive Visualizations: Bar charts, line charts, and pie charts provide
straightforward representations of data distributions or trends, making
them ideal for summarizing large datasets.

Diagnostic Visualizations: Scatter plots and heatmaps help in identifying


relationships and patterns between variables, useful for diagnosing
problems or testing hypotheses.

Predictive Visualizations: These visualizations, like regression lines and


forecasting plots, help in understanding future trends and outcomes.

Interactive Visualizations: Tools like Power BI and Tableau allow users to


interact with data dynamically, creating customized views and deeper
insights, suitable for dashboards and exploratory analysis.

5. Discuss different application areas of data profiling


Data profiling ensures that data is accurate, consistent, and of high quality. It is
used across various domains to improve data usability:

Database Management: Ensures that databases are free of errors and


inconsistencies, maintaining the integrity and reliability of the data stored for
future use.

Data Warehousing: Data profiling is used to assess the quality of incoming


data before integrating it into a data warehouse, ensuring that it adheres to
expected standards.

Compliance and Governance: It supports compliance with industry


regulations by ensuring that sensitive data is consistent, accurate, and
complete, minimizing legal and operational risks.

Market Research: Profiling customer data helps in identifying trends and


customer segments, allowing businesses to tailor products and marketing
strategies more effectively.

6. Elaborate different steps in EDA

QUESTION BANK 3
Exploratory Data Analysis (EDA) is a crucial step in understanding data before
applying any modeling techniques:

Data Collection and Loading: Gather data from various sources (databases,
APIs, or files) and load it into an appropriate environment for analysis.

Data Cleaning: This step involves handling missing values, detecting


duplicates, and removing outliers, ensuring that the data is consistent and
ready for analysis.

Summary Statistics: Use statistical measures like mean, median, standard


deviation, and percentiles to get an overview of the data’s distribution and
identify any immediate patterns or anomalies.

Visualization: Create various plots like histograms, boxplots, and scatterplots


to visually explore the relationships and distributions of the data.

Hypothesis Formation: Develop initial hypotheses based on observations


from the data, which can be tested through further analysis or model building
to guide insights.

7. Comment on Confirmatory Data Analysis


Confirmatory Data Analysis (CDA) is designed to validate or reject hypotheses
formed during exploratory analysis:

Purpose: Unlike EDA, which is used to explore data, CDA tests predefined
hypotheses using statistical methods to confirm if initial assumptions hold
true.

Techniques: It involves formal statistical tests, such as t-tests, chi-square


tests, and regression models, to measure the significance of data findings.

Applications: Commonly used in academic research and decision-making


environments to test theories or validate claims with a high degree of
certainty.

Limitations: CDA requires careful hypothesis formulation and may overlook


hidden insights that weren't considered during the hypothesis stage.

8. List down different types of missing values

QUESTION BANK 4
There are several ways that missing data can occur, each with its own
implications:

Missing Completely at Random (MCAR): The missing data is unrelated to both


the observed and unobserved data. It is often considered the least
problematic type.

Missing at Random (MAR): The missingness is related to other observed


variables but not the missing value itself, which may introduce some bias in
analysis.

Missing Not at Random (MNAR): The missingness is directly related to the


value that is missing, which can lead to more significant bias and challenges
when handling the data.

9. Elaborate different methods to handle missing data values


Handling missing data is essential for maintaining the integrity of a dataset:

Deletion Methods: Removing data points with missing values (listwise


deletion) or eliminating entire columns (pairwise deletion) is simple but may
lead to a loss of valuable information.

Imputation Techniques:

Mean/Median/Mode Imputation: For numerical data, missing values can


be replaced with the mean or median of the column; for categorical data,
the mode can be used.

Predictive Imputation: This method uses machine learning algorithms to


predict missing values based on other variables in the dataset.

KNN Imputation: The k-nearest neighbors algorithm can be used to


estimate missing values by considering similar data points.

Using Algorithms that Handle Missing Data: Some machine learning


algorithms, like decision trees, can naturally handle missing values without the
need for imputation.

10. Differentiate between Null Hypothesis and Alternative


Hypothesis

QUESTION BANK 5
In hypothesis testing, the null and alternative hypotheses serve complementary
roles:

Null Hypothesis (H₀): It represents the default assumption that there is no


effect or relationship between variables. Researchers seek to either reject or
fail to reject the null hypothesis based on evidence. Example: "There is no
significant difference between the two treatments."

Alternative Hypothesis (H₁): This represents the hypothesis that there is a


significant effect or relationship between variables. It contradicts the null
hypothesis and is what researchers generally aim to support with evidence.
Example: "Treatment A is more effective than Treatment B."

1. Calculate the Mean, Mode, and Median for the given dataset

2. Test at the 5% significance level whether there is sufficient


evidence that the mean time has decreased

QUESTION BANK 6
Given:

3. List down steps in a study of market segmentation

QUESTION BANK 7
Market segmentation involves dividing a broad consumer or business market,
typically consisting of existing and potential customers, into sub-groups of
consumers based on some type of shared characteristics. The steps involved in
market segmentation are:

1. Defining the Market: Identify and define the total market to be segmented,
including product or service offerings.

2. Identifying Segmentation Variables: Select the relevant segmentation


variables, such as demographic, geographic, psychographic, or behavioral
factors.

3. Data Collection: Gather data from customers through surveys, focus groups,
or secondary research.

4. Segmenting the Market: Analyze the data to divide the market into distinct
segments based on the chosen variables.

5. Targeting: Evaluate the segments to identify the most attractive ones to target
based on size, growth potential, and fit with company objectives.

6. Positioning: Develop a positioning strategy that communicates how your


product or service meets the needs of the targeted segments.

7. Monitoring and Re-Evaluation: Continuously monitor market dynamics and


re-evaluate segmentation to ensure relevance.

4. List different error measures for evaluating forecast models


The accuracy of forecast models can be assessed using various error measures.
Commonly used error metrics include:

QUESTION BANK 8
5. Elaborate different steps in Additive Seasonal Adjustment
Additive Seasonal Adjustment is a method used to remove seasonal fluctuations
from time series data to analyze underlying trends. The steps involved are:

1. Identify the Seasonal Component: The first step is to detect the seasonality in
the data (e.g., monthly, quarterly). This is done by examining the data over
multiple periods to identify consistent patterns or cycles.

2. Calculate the Seasonal Index: For each period in a year (or season), compute
the seasonal index which reflects the percentage by which the value of a time
series deviates from the average for that period.

3. Remove the Seasonal Component: Subtract the seasonal index from the
observed values to adjust the data for seasonal effects.

QUESTION BANK 9
4. Analyze the Trend: Once the seasonality is removed, the remaining data can
be used to observe long-term trends, cyclic patterns, or irregular components.

5. Re-seasonalize (if necessary): After analyzing the adjusted data, you can
sometimes reintegrate the seasonal component back to the model if needed
for forecasting future periods.

6. Comment on MapReduce Component of Hadoop File System


MapReduce is a powerful data processing framework used in the Hadoop
ecosystem to process large-scale data across distributed computing resources. It
is composed of two main functions:

1. Map Function: The map function takes input data and converts it into key-
value pairs, which are distributed across multiple nodes for parallel
processing.

2. Reduce Function: The reduce function takes the output from the map function
and aggregates or combines the data to produce final results.

Advantages:

Scalability: Can handle large volumes of data by distributing processing tasks


across a cluster of machines.

Fault Tolerance: If a node fails, MapReduce automatically recovers, rerouting


tasks to healthy nodes.

Parallel Processing: Tasks are divided and executed in parallel, significantly


speeding up the computation process.

7. List features of Scikit-Learn library of Python


Scikit-learn is a popular machine learning library in Python, providing tools for
data mining and data analysis. Key features include:

1. Wide Range of Algorithms: It supports a variety of machine learning


algorithms such as regression, classification, clustering, and dimensionality
reduction.

2. Preprocessing Tools: Scikit-learn provides several utilities for data


preprocessing like scaling, normalization, encoding categorical variables, and

QUESTION BANK 10
imputation of missing values.

3. Model Evaluation: The library includes several tools for model validation,
including cross-validation, metrics like accuracy, precision, recall, and
confusion matrices.

4. Ease of Use: It offers a simple, consistent API for fitting, predicting, and
evaluating models, making it easy to work with.

5. Integration with Other Libraries: Scikit-learn integrates seamlessly with other


Python libraries like NumPy, pandas, and Matplotlib, making it highly versatile
for data analysis workflows.

6. Efficiency: Scikit-learn is optimized for performance, with many of its


algorithms implemented in Cython or C.

MAIN QUESTIONS
1. Discuss different phases in the lifecycle of Data Analysis.
The lifecycle of data analysis typically consists of multiple phases:

Problem Formulation: Identify and clearly define the problem or business


question to guide the analysis.

Data Collection: Gather relevant data from various sources, which may
include databases, web scraping, or IoT devices.

Data Cleaning and Preprocessing: Handle missing values, remove outliers,


and ensure data consistency to prepare data for analysis.

Exploratory Data Analysis (EDA): Use summary statistics and visualization to


understand data patterns, distributions, and relationships.

Data Modeling: Apply analytical models, such as machine learning algorithms,


to extract insights or make predictions.

QUESTION BANK 11
Result Communication: Present findings in a clear and actionable way, often
using visualizations and reports for stakeholders to make informed decisions.

2. Justify the need for data analysis in the Retail industry.


Data analysis is critical in the retail industry as it drives various aspects of
business performance:

Customer Insights: Helps retailers understand buying patterns, preferences,


and demographics, enabling personalized marketing.

Inventory Optimization: Analyzing sales trends assists in forecasting demand,


reducing overstock or stockouts, and enhancing inventory management.

Pricing Strategies: Retailers can use historical sales data and competitor
analysis to develop effective pricing strategies and optimize profitability.

Enhanced Customer Experience: Data-driven insights help in improving store


layouts, product placements, and customer service, creating a seamless
shopping experience.

3. Discuss the Characteristics of Big Data.


Big Data is characterized by the following “3Vs” (and sometimes “5Vs”):

Volume: Refers to the massive amount of data generated daily from various
sources like social media, IoT devices, and transactions.

Velocity: Data is generated at a high speed and must often be processed in


real-time to extract timely insights.

Variety: Data comes in diverse formats, including structured data (databases),


semi-structured (JSON, XML), and unstructured data (texts, images).

Veracity: Ensures the accuracy and trustworthiness of data despite its


inconsistencies or incompleteness.

Value: Emphasizes the potential of Big Data to generate meaningful insights


that can drive business decisions.

QUESTION BANK 12
4. Distinguish between Exploratory and Confirmatory Data
Analysis.
Exploratory Data Analysis (EDA): Aimed at exploring data patterns, identifying
anomalies, and generating hypotheses. It often involves visualizations and
summary statistics and helps in understanding the dataset's structure without
a specific hypothesis.

Confirmatory Data Analysis (CDA): Focuses on testing a specific hypothesis


or validating assumptions using statistical tests. It aims to confirm or reject
hypotheses and typically involves inferential statistics, such as p-values and
confidence intervals.

5. Elaborate different Data Profiling Functions.


Data profiling involves analyzing datasets to gather summary information about
data quality and structure:

Column Profiling: Examines individual columns to compute metrics like


minimum, maximum, mean, and unique value counts.

Dependency Profiling: Identifies relationships between columns, helping in


understanding dependencies that could impact analysis.

Redundancy Profiling: Checks for duplicate records or columns to help in


data cleaning.

Structure Profiling: Ensures data format consistency and examines patterns,


such as email or phone number formats, for accuracy.

Exploratory Data Analysis


Aspect Confirmatory Data Analysis (CDA)
(EDA)

To explore data patterns,


To test hypotheses and validate
Objective summarize features, and
assumptions
identify trends

Initial data investigation and Hypothesis testing and statistical


Purpose
hypothesis generation inference

Open-ended, flexible, and Structured and methodical, with a


Approach
often unstructured predefined hypothesis

QUESTION BANK 13
Techniques Visualizations, summary Statistical tests (t-tests, chi-
Used statistics, correlation analysis square tests), p-values

Conclusions about relationships or


Insights, patterns, and
Outcomes effects, hypothesis confirmation or
potential questions
rejection

Requires raw or pre-


Data Requires cleaned, often structured
processed data for initial
Requirements data for reliable testing
insights

Python (Pandas, Matplotlib, Statistical software (SPSS, R),


Common Tools
Seaborn), R, Tableau hypothesis testing libraries

Detecting data trends or


Testing the effect of a drug in a
Examples anomalies, identifying
medical study, A/B testing
correlations

Result Generates hypotheses to be Provides statistical evidence for or


Interpretation further tested against a hypothesis

6. Discuss features of Power BI.


Power BI offers several powerful features for data analysis and visualization:

Interactive Dashboards: Create dynamic dashboards that allow users to


interact with data in real time.

Data Transformation Tools: Includes data cleaning, transformation, and


integration tools for preparing data.

AI-Powered Insights: Provides built-in AI capabilities for automated insights


and natural language querying.

Multiple Data Source Integration: Connects to various data sources, such as


Excel, SQL databases, and cloud services, making it versatile for data
ingestion.

Custom Visualizations: Allows the creation of custom charts, maps, and


visuals to suit specific business needs.

7. List down different Data Repositories.


Common data repositories used in data science include:

QUESTION BANK 14
Relational Databases: Such as MySQL, PostgreSQL, and Oracle for structured
data storage.

Data Warehouses: Like Amazon Redshift, Google BigQuery, and Snowflake for
large-scale, structured data aggregation.

Data Lakes: Such as AWS S3 and Azure Data Lake, used to store structured,
semi-structured, and unstructured data.

NoSQL Databases: Including MongoDB, Cassandra, and HBase, suited for


handling unstructured and semi-structured data.

Cloud Storage Platforms: AWS, Google Cloud, and Microsoft Azure offer
scalable storage solutions for large datasets.

8. Why is it necessary to handle missing values?


Handling missing values is essential to maintain data quality and accuracy:

Avoids Bias: Missing values can introduce bias if not addressed, as the
dataset may not represent the entire population.

Prevents Errors: Models and algorithms often cannot handle missing values,
leading to errors or inaccurate predictions.

Improves Consistency: Filling or removing missing values ensures the dataset


is consistent, enabling more reliable analysis.

Enhances Interpretability: Clean, complete data provides better insights and


clearer interpretation, supporting sound decision-making.

9. Elaborate different types of data attribute values.


Data attributes can be classified into several types based on their nature:

Nominal: Categorical data without a specific order, such as gender or country


names.

Ordinal: Categorical data with a meaningful order, like survey responses (e.g.,
agree, neutral, disagree).

Interval: Numeric data with equal intervals but no true zero point, such as
temperature in Celsius.

QUESTION BANK 15
Ratio: Numeric data with a true zero, allowing for meaningful comparisons and
ratios, like weight or height.

Binary: Data with only two values, such as yes/no or true/false, often used for
classifications.

1. List down formulas for Location Measures of a Sample

2. Construct a probability distribution for the random variable xxx.


Given:

3. Hypothesis Testing for Assistant Professors' Salary

QUESTION BANK 16
Given:

4. List down different human biases in forecasting


Anchoring Bias: Relying too heavily on the first piece of information when
making decisions.

Confirmation Bias: Searching for information that confirms existing beliefs.

Recency Bias: Giving more weight to recent events rather than considering
the entire data.

Overconfidence Bias: Overestimating one’s forecasting ability.

Availability Bias: Relying on information that is readily available or memorable


rather than comprehensive.

QUESTION BANK 17
5. List Down different primary characteristics of segment
Demographics: Age, gender, income level, and education.

Geographic: Location-based characteristics like country, city, and climate.

Behavioral: Buying habits, brand loyalty, and usage rates.

Psychographic: Lifestyle, values, and interests.

Technographic: Technology usage, device preferences, and digital


engagement levels.

6. Identify and discuss situations in which A/B testing is useful


A/B testing is useful in situations where a direct comparison between two options
can improve decision-making:

Marketing Campaigns: Testing different email subject lines or ad creatives to


determine the most effective.

Website Design: Comparing layouts, buttons, or call-to-action (CTA)


placements to increase user engagement.

Product Feature Rollout: Testing new features or designs with a subset of


users to assess impact before full deployment.

Pricing Strategies: Evaluating different pricing options to understand their


effect on sales.

User Experience (UX): Comparing navigation flows or form designs to


enhance user satisfaction and conversion rates.

7. Differentiate between Supervised and Unsupervised NLP


Aspect Supervised NLP Unsupervised NLP

Involves labeled data to train Involves unlabeled data, finding


Definition
models structure within

Data
Labeled text data Unlabeled text data
Requirements

Sentiment analysis, text


Tasks Topic modeling, clustering
classification

QUESTION BANK 18
Spam detection, named entity Word embeddings, latent semantic
Examples
recognition analysis

Predictive, with defined output Descriptive, identifying patterns and


Outcome
labels groups

8. List down features of Numpy Python Library


Array Support: Provides a powerful N-dimensional array object.

Mathematical Functions: Includes functions for complex mathematical


operations (e.g., linear algebra, Fourier transforms).

Broadcasting: Allows operations on arrays of different shapes without


additional memory.

Indexing: Offers advanced slicing and indexing options for efficient data
manipulation.

Performance: Optimized for performance, often faster than standard Python


lists for numeric data processing.

Integration: Easily integrates with other Python libraries like Pandas and
Matplotlib.

9. Justify significance of the following Components of Hadoop


Ecosystem

1) HDFS (Hadoop Distributed File System)


Purpose: HDFS is designed to store vast amounts of data reliably across
multiple machines.

Data Storage: It breaks large data files into smaller blocks, which are stored
across distributed nodes.

Fault Tolerance: HDFS maintains data replication across nodes, ensuring data
availability even in case of hardware failures.

Scalability: Supports horizontal scaling, allowing the addition of storage as


data grows.

QUESTION BANK 19
Accessibility: Enables distributed data access, supporting big data
processing frameworks like MapReduce.

2) YARN (Yet Another Resource Negotiator)


Resource Management: YARN manages and allocates computational
resources to various applications running on a Hadoop cluster.

Application Execution: It enables multiple applications (such as MapReduce)


to share resources within the same cluster.

Scalability: Improves cluster utilization, ensuring resources are used


efficiently.

Flexibility: Supports a variety of processing frameworks beyond MapReduce,


allowing different data processing methods to coexist in Hadoop.

10. Elaborate different steps in implementation of NLP

1. Text Preprocessing
Tokenization: This step involves breaking down text into individual tokens
(words, phrases, or sentences). Tokenization is crucial because it prepares
raw text data for analysis by separating it into manageable components.

Lowercasing: Converting all text to lowercase ensures uniformity, helping


to avoid discrepancies between words like “Apple” and “apple.”

Removing Stop Words: Common words (like “and,” “the,” “is”) that do not
carry significant meaning are removed to reduce noise in the data.

Punctuation Removal: Removing punctuation symbols to standardize the


text and improve the accuracy of text analysis.

2. Text Normalization
Stemming: Reducing words to their root forms (e.g., “running” to “run”) by
chopping off affixes. This helps in simplifying data and reduces vocabulary
size.

QUESTION BANK 20
Lemmatization: Like stemming, lemmatization reduces words to their base
forms but considers context and part of speech to ensure grammatical
accuracy (e.g., “better” becomes “good”).

3. Feature Extraction
Bag of Words (BoW): A technique that represents text data as a collection
of individual words, often used to quantify text by counting word
occurrences. It disregards the order but retains word frequency.

TF-IDF (Term Frequency-Inverse Document Frequency): Calculates the


importance of words in a document based on how often they appear
across multiple documents, helping to identify relevant terms.

Word Embeddings: Converting words into dense vectors (e.g., using


Word2Vec, GloVe) that capture semantic meaning. This enables the model
to understand relationships between words based on context.

4. Text Encoding
One-Hot Encoding: Converts words or tokens into binary vectors, where
each word is represented by a unique position in the vector. Useful for
simpler models but inefficient with large vocabularies.

Label Encoding: Assigns numerical values to categories, suitable for


smaller vocabularies or when order matters.

Target Encoding: Replaces categorical values with a numerical encoding


derived from a target variable, often useful in supervised NLP tasks.

5. Model Building and Training


Supervised NLP Models: For tasks like text classification or sentiment
analysis, models are trained on labeled data. Common algorithms include
Naive Bayes, Support Vector Machines, and neural networks.

Unsupervised NLP Models: For tasks like topic modeling or clustering,


models like Latent Dirichlet Allocation (LDA) are used to find hidden
structures in the data without labeled outputs.

6. Model Evaluation

QUESTION BANK 21
Metrics: Evaluate model performance using metrics such as accuracy,
precision, recall, and F1 score. These metrics help assess how well the
model performs on classification or prediction tasks.

Cross-Validation: Splitting data into training and validation sets, or using k-


fold cross-validation, to ensure that the model generalizes well on unseen
data.

7. Deployment and Application


Integration: Once validated, the model is deployed into a production
environment where it can be integrated into applications like chatbots,
recommendation engines, or sentiment analysis systems.

Monitoring and Iteration: Continuously monitor the model’s performance


and update it as necessary, retraining with new data to improve its
accuracy and reliability.

QUESTION BANK 22

You might also like