DM - Midsem - Question Bank
DM - Midsem - Question Bank
A: Data mining is the process of discovering patterns, trends, and insights from large datasets
to extract useful information for decision-making and predictive analysis.
A: Data mining has its roots in the 1960s and 1970s when statisticians began using computers
to analyze data. The term "data mining" gained popularity in the 1990s as computational
power increased and businesses began to recognize the value of extracting insights from their
data.
A: Data mining is applied in various fields including business and marketing (customer
segmentation, market basket analysis), healthcare (disease prediction, patient outcome
analysis), finance (fraud detection, risk management), and science (genome analysis,
environmental monitoring).
A: Challenges in data mining include dealing with large volumes of data (big data), ensuring
data quality and consistency, addressing privacy concerns, handling noisy and incomplete
data, and selecting appropriate algorithms for specific tasks.
Q: What are the issues in the Knowledge Discovery in Databases (KDD) process?
A: Database data refers to structured data stored in relational databases, typically organized in
tables with predefined schemas. This type of data is commonly used in data mining for
analysis and modeling purposes.
Q: What are data warehouses and how are they relevant to data mining?
A: Data warehouses are centralized repositories that store integrated and structured data from
various sources for reporting and analysis. They are relevant to data mining as they provide a
unified view of data, which facilitates the discovery of patterns and trends across different
data sources.
Q: What are some other kinds of data that can be used in data mining?
A: Other kinds of data used in data mining include textual data (documents, emails), sensor
data (from IoT devices), social media data (tweets, posts), biological data (DNA sequences),
and streaming data (real-time data feeds). These diverse types of data provide valuable
insights when analyzed using appropriate techniques.
A: Nominal attributes are categorical variables that represent qualitative data without any
inherent order or ranking. Examples include colors, types of animals, or categories of
products.
A: Binary attributes are nominal attributes with only two possible values, typically
represented as 0 and 1 or as "yes" and "no". Examples include gender (male/female),
presence/absence of a characteristic, or true/false responses.
Q: What are ordinal attributes?
A: Ordinal attributes are categorical variables that have a natural order or ranking but with
uneven intervals between values. Examples include ratings (e.g., 1 to 5 stars), education
levels (e.g., high school, college, graduate), or socioeconomic status (e.g., low, medium,
high).
A: Numeric attributes are variables that represent quantitative data and can take on numerical
values. They can be further classified into discrete and continuous attributes.
A: Mean is the average value of a set of numbers calculated by summing all values and
dividing by the total count. Median is the middle value when the data is arranged in
ascending or descending order. Mode is the value that appears most frequently in a dataset.
A: The dispersion of data refers to how spread out or clustered the values are around the
central tendency (mean, median, or mode). Common measures of dispersion include range,
quartiles, variance, and standard deviation.
A: Range is the difference between the maximum and minimum values in a dataset. It
provides a simple measure of the spread of data but is sensitive to outliers.
A: Quartiles divide a dataset into four equal parts, each containing 25% of the data. The first
quartile (Q1) is the value below which 25% of the data falls, the second quartile (Q2) is the
median, and the third quartile (Q3) is the value below which 75% of the data falls.
Q: What is variance?
A: Variance measures the average squared deviation of each data point from the mean. It
provides a measure of how much the values in a dataset vary from the mean.
Q: How is data value conflict detection and resolution performed in data integration?
A: Data value conflict detection involves identifying discrepancies or conflicts in data values across
different sources. Resolution methods may include using voting schemes, expert judgment, or
statistical methods to reconcile conflicting information and create a consistent dataset.