DSA Question Bank
DSA Question Bank
Data Mining and Knowledge Discovery: Involves finding hidden patterns and
trends in large datasets, which can inform decisions in e-commerce, customer
relations, and more.
QUESTION BANK 1
AI and ML: Develops models that perform tasks like language translation,
image recognition, and recommendation systems. These have practical
applications in autonomous vehicles, personalized marketing, and more.
QUESTION BANK 2
Techniques:
Descriptive Visualizations: Bar charts, line charts, and pie charts provide
straightforward representations of data distributions or trends, making
them ideal for summarizing large datasets.
QUESTION BANK 3
Exploratory Data Analysis (EDA) is a crucial step in understanding data before
applying any modeling techniques:
Data Collection and Loading: Gather data from various sources (databases,
APIs, or files) and load it into an appropriate environment for analysis.
Purpose: Unlike EDA, which is used to explore data, CDA tests predefined
hypotheses using statistical methods to confirm if initial assumptions hold
true.
QUESTION BANK 4
There are several ways that missing data can occur, each with its own
implications:
Imputation Techniques:
QUESTION BANK 5
In hypothesis testing, the null and alternative hypotheses serve complementary
roles:
1. Calculate the Mean, Mode, and Median for the given dataset
QUESTION BANK 6
Given:
QUESTION BANK 7
Market segmentation involves dividing a broad consumer or business market,
typically consisting of existing and potential customers, into sub-groups of
consumers based on some type of shared characteristics. The steps involved in
market segmentation are:
1. Defining the Market: Identify and define the total market to be segmented,
including product or service offerings.
3. Data Collection: Gather data from customers through surveys, focus groups,
or secondary research.
4. Segmenting the Market: Analyze the data to divide the market into distinct
segments based on the chosen variables.
5. Targeting: Evaluate the segments to identify the most attractive ones to target
based on size, growth potential, and fit with company objectives.
QUESTION BANK 8
5. Elaborate different steps in Additive Seasonal Adjustment
Additive Seasonal Adjustment is a method used to remove seasonal fluctuations
from time series data to analyze underlying trends. The steps involved are:
1. Identify the Seasonal Component: The first step is to detect the seasonality in
the data (e.g., monthly, quarterly). This is done by examining the data over
multiple periods to identify consistent patterns or cycles.
2. Calculate the Seasonal Index: For each period in a year (or season), compute
the seasonal index which reflects the percentage by which the value of a time
series deviates from the average for that period.
3. Remove the Seasonal Component: Subtract the seasonal index from the
observed values to adjust the data for seasonal effects.
QUESTION BANK 9
4. Analyze the Trend: Once the seasonality is removed, the remaining data can
be used to observe long-term trends, cyclic patterns, or irregular components.
5. Re-seasonalize (if necessary): After analyzing the adjusted data, you can
sometimes reintegrate the seasonal component back to the model if needed
for forecasting future periods.
1. Map Function: The map function takes input data and converts it into key-
value pairs, which are distributed across multiple nodes for parallel
processing.
2. Reduce Function: The reduce function takes the output from the map function
and aggregates or combines the data to produce final results.
Advantages:
QUESTION BANK 10
imputation of missing values.
3. Model Evaluation: The library includes several tools for model validation,
including cross-validation, metrics like accuracy, precision, recall, and
confusion matrices.
4. Ease of Use: It offers a simple, consistent API for fitting, predicting, and
evaluating models, making it easy to work with.
MAIN QUESTIONS
1. Discuss different phases in the lifecycle of Data Analysis.
The lifecycle of data analysis typically consists of multiple phases:
Data Collection: Gather relevant data from various sources, which may
include databases, web scraping, or IoT devices.
QUESTION BANK 11
Result Communication: Present findings in a clear and actionable way, often
using visualizations and reports for stakeholders to make informed decisions.
Pricing Strategies: Retailers can use historical sales data and competitor
analysis to develop effective pricing strategies and optimize profitability.
Volume: Refers to the massive amount of data generated daily from various
sources like social media, IoT devices, and transactions.
QUESTION BANK 12
4. Distinguish between Exploratory and Confirmatory Data
Analysis.
Exploratory Data Analysis (EDA): Aimed at exploring data patterns, identifying
anomalies, and generating hypotheses. It often involves visualizations and
summary statistics and helps in understanding the dataset's structure without
a specific hypothesis.
QUESTION BANK 13
Techniques Visualizations, summary Statistical tests (t-tests, chi-
Used statistics, correlation analysis square tests), p-values
QUESTION BANK 14
Relational Databases: Such as MySQL, PostgreSQL, and Oracle for structured
data storage.
Data Warehouses: Like Amazon Redshift, Google BigQuery, and Snowflake for
large-scale, structured data aggregation.
Data Lakes: Such as AWS S3 and Azure Data Lake, used to store structured,
semi-structured, and unstructured data.
Cloud Storage Platforms: AWS, Google Cloud, and Microsoft Azure offer
scalable storage solutions for large datasets.
Avoids Bias: Missing values can introduce bias if not addressed, as the
dataset may not represent the entire population.
Prevents Errors: Models and algorithms often cannot handle missing values,
leading to errors or inaccurate predictions.
Ordinal: Categorical data with a meaningful order, like survey responses (e.g.,
agree, neutral, disagree).
Interval: Numeric data with equal intervals but no true zero point, such as
temperature in Celsius.
QUESTION BANK 15
Ratio: Numeric data with a true zero, allowing for meaningful comparisons and
ratios, like weight or height.
Binary: Data with only two values, such as yes/no or true/false, often used for
classifications.
QUESTION BANK 16
Given:
Recency Bias: Giving more weight to recent events rather than considering
the entire data.
QUESTION BANK 17
5. List Down different primary characteristics of segment
Demographics: Age, gender, income level, and education.
Data
Labeled text data Unlabeled text data
Requirements
QUESTION BANK 18
Spam detection, named entity Word embeddings, latent semantic
Examples
recognition analysis
Indexing: Offers advanced slicing and indexing options for efficient data
manipulation.
Integration: Easily integrates with other Python libraries like Pandas and
Matplotlib.
Data Storage: It breaks large data files into smaller blocks, which are stored
across distributed nodes.
Fault Tolerance: HDFS maintains data replication across nodes, ensuring data
availability even in case of hardware failures.
QUESTION BANK 19
Accessibility: Enables distributed data access, supporting big data
processing frameworks like MapReduce.
1. Text Preprocessing
Tokenization: This step involves breaking down text into individual tokens
(words, phrases, or sentences). Tokenization is crucial because it prepares
raw text data for analysis by separating it into manageable components.
Removing Stop Words: Common words (like “and,” “the,” “is”) that do not
carry significant meaning are removed to reduce noise in the data.
2. Text Normalization
Stemming: Reducing words to their root forms (e.g., “running” to “run”) by
chopping off affixes. This helps in simplifying data and reduces vocabulary
size.
QUESTION BANK 20
Lemmatization: Like stemming, lemmatization reduces words to their base
forms but considers context and part of speech to ensure grammatical
accuracy (e.g., “better” becomes “good”).
3. Feature Extraction
Bag of Words (BoW): A technique that represents text data as a collection
of individual words, often used to quantify text by counting word
occurrences. It disregards the order but retains word frequency.
4. Text Encoding
One-Hot Encoding: Converts words or tokens into binary vectors, where
each word is represented by a unique position in the vector. Useful for
simpler models but inefficient with large vocabularies.
6. Model Evaluation
QUESTION BANK 21
Metrics: Evaluate model performance using metrics such as accuracy,
precision, recall, and F1 score. These metrics help assess how well the
model performs on classification or prediction tasks.
QUESTION BANK 22