0% found this document useful (0 votes)
28 views5 pages

DM - Midsem - Question Bank

Uploaded by

Rudra Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views5 pages

DM - Midsem - Question Bank

Uploaded by

Rudra Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

03610335-Data Mining

Unit-1 Fundamentals of data mining

Q: What is data mining?

A: Data mining is the process of discovering patterns, trends, and insights from large datasets
to extract useful information for decision-making and predictive analysis.

Q: What is the history of data mining?

A: Data mining has its roots in the 1960s and 1970s when statisticians began using computers
to analyze data. The term "data mining" gained popularity in the 1990s as computational
power increased and businesses began to recognize the value of extracting insights from their
data.

Q: What are some strategies and techniques used in data mining?

A: Strategies in data mining include association rule mining, classification, clustering,


regression analysis, and anomaly detection. Techniques such as decision trees, neural
networks, genetic algorithms, and support vector machines are commonly employed for these
purposes.

Q: What are some applications of data mining?

A: Data mining is applied in various fields including business and marketing (customer
segmentation, market basket analysis), healthcare (disease prediction, patient outcome
analysis), finance (fraud detection, risk management), and science (genome analysis,
environmental monitoring).

Q: What are the challenges of data mining?

A: Challenges in data mining include dealing with large volumes of data (big data), ensuring
data quality and consistency, addressing privacy concerns, handling noisy and incomplete
data, and selecting appropriate algorithms for specific tasks.

Q: What is the future of data mining?

A: The future of data mining is likely to involve advancements in machine learning


algorithms, deep learning, and artificial intelligence. Integration with other technologies such
as IoT and blockchain may further enhance its capabilities for real-time analysis and
decision-making.

Q: What are the issues in the Knowledge Discovery in Databases (KDD) process?

A: Issues in the KDD process include data preprocessing (cleaning, integration,


transformation), selecting suitable data mining techniques, interpreting and evaluating results,
and deploying models into operational systems.

Q: What are the types of data used in data mining?


A: Types of data used in data mining include structured data (relational databases), semi-
structured data (XML, JSON), unstructured data (text documents, images), spatial data
(geographical information), temporal data (time series), and multimedia data (audio, video).

Q: What is database data in the context of data mining?

A: Database data refers to structured data stored in relational databases, typically organized in
tables with predefined schemas. This type of data is commonly used in data mining for
analysis and modeling purposes.

Q: What are data warehouses and how are they relevant to data mining?

A: Data warehouses are centralized repositories that store integrated and structured data from
various sources for reporting and analysis. They are relevant to data mining as they provide a
unified view of data, which facilitates the discovery of patterns and trends across different
data sources.

Q: What is transactional data in the context of data mining?

A: Transactional data refers to records of individual transactions or events, such as purchases,


interactions, or behaviors. Analyzing transactional data can reveal patterns, associations, and
trends that are useful for business intelligence and decision-making.

Q: What are some other kinds of data that can be used in data mining?

A: Other kinds of data used in data mining include textual data (documents, emails), sensor
data (from IoT devices), social media data (tweets, posts), biological data (DNA sequences),
and streaming data (real-time data feeds). These diverse types of data provide valuable
insights when analyzed using appropriate techniques.

Unit-2 Objects, Attributes, & Statistical Description of Data

Q: What is a data attribute?

A: A data attribute, also known as a feature or variable, is a characteristic or property of an


object or phenomenon that can be measured or observed. In data mining and statistics,
attributes are used to describe and analyze data.

Q: What are nominal attributes?

A: Nominal attributes are categorical variables that represent qualitative data without any
inherent order or ranking. Examples include colors, types of animals, or categories of
products.

Q: What are binary attributes?

A: Binary attributes are nominal attributes with only two possible values, typically
represented as 0 and 1 or as "yes" and "no". Examples include gender (male/female),
presence/absence of a characteristic, or true/false responses.
Q: What are ordinal attributes?

A: Ordinal attributes are categorical variables that have a natural order or ranking but with
uneven intervals between values. Examples include ratings (e.g., 1 to 5 stars), education
levels (e.g., high school, college, graduate), or socioeconomic status (e.g., low, medium,
high).

Q: What are numeric attributes?

A: Numeric attributes are variables that represent quantitative data and can take on numerical
values. They can be further classified into discrete and continuous attributes.

Q: What is the difference between discrete and continuous attributes?

A: Discrete attributes take on a finite or countable number of distinct values, while


continuous attributes can take on any value within a certain range. For example, the number
of children in a family is discrete, whereas temperature is continuous.

Q: What are mean, median, and mode?

A: Mean is the average value of a set of numbers calculated by summing all values and
dividing by the total count. Median is the middle value when the data is arranged in
ascending or descending order. Mode is the value that appears most frequently in a dataset.

Q: How do you measure the dispersion of data?

A: The dispersion of data refers to how spread out or clustered the values are around the
central tendency (mean, median, or mode). Common measures of dispersion include range,
quartiles, variance, and standard deviation.

Q: What is range as a measure of dispersion?

A: Range is the difference between the maximum and minimum values in a dataset. It
provides a simple measure of the spread of data but is sensitive to outliers.

Q: What are quartiles?

A: Quartiles divide a dataset into four equal parts, each containing 25% of the data. The first
quartile (Q1) is the value below which 25% of the data falls, the second quartile (Q2) is the
median, and the third quartile (Q3) is the value below which 75% of the data falls.

Q: What is variance?

A: Variance measures the average squared deviation of each data point from the mean. It
provides a measure of how much the values in a dataset vary from the mean.

Q: What is standard deviation?


A: Standard deviation is the square root of the variance and provides a measure of the
dispersion of data around the mean. It is commonly used because it is expressed in the same
units as the original data and is sensitive to outliers.

Unit-3 Data Preprocessing


Q: What is data preprocessing?
A: Data preprocessing is the initial step in the data mining process that involves cleaning,
transforming, and organizing raw data into a format suitable for analysis. It aims to improve the
quality of data and prepare it for modelling.

Q: What are the major tasks in data preprocessing?


A: The major tasks in data preprocessing include data cleaning, data integration, data
transformation, and data reduction. These tasks address issues such as missing values, noise,
inconsistencies, and redundancy in the data.

Q: What is data cleaning and why is it important?


A: Data cleaning is the process of identifying and correcting errors, inconsistencies, and missing
values in a dataset. It is important because clean data ensures the accuracy and reliability of analysis
results.

Q: What are some common issues addressed in data cleaning?


A: Common issues in data cleaning include handling missing values, dealing with noisy data (outliers
and errors), and resolving inconsistencies such as duplicate records or conflicting information.

Q: How is missing data handled in data cleaning?


A: Missing data can be handled by imputation techniques such as mean imputation (replacing
missing values with the mean of the variable), mode imputation (replacing missing values with the
most frequent value), or using predictive models to estimate missing values based on other
variables.

Q: What is data integration and why is it important?


A: Data integration is the process of combining data from multiple sources into a unified view. It is
important for eliminating redundancy, resolving inconsistencies, and providing a comprehensive
dataset for analysis.

Q: What is the entity identification problem in data integration?


A: The entity identification problem arises when different datasets use different identifiers or
formats to represent the same entities. Resolving this problem involves identifying corresponding
entities across datasets and merging them into a single entity.
Q: How is redundancy and correlation analysis conducted in data integration?
A: Redundancy and correlation analysis involves identifying redundant attributes or tuples in the
integrated dataset. This can be done by analyzing correlations between variables and removing
redundant information to simplify the dataset.

Q: What is tuple duplication and how is it addressed in data integration?


A: Tuple duplication occurs when the same record appears multiple times in a dataset, either due to
errors or intentional duplication. It is addressed by identifying and removing duplicate tuples to
ensure data integrity and accuracy.

Q: How is data value conflict detection and resolution performed in data integration?
A: Data value conflict detection involves identifying discrepancies or conflicts in data values across
different sources. Resolution methods may include using voting schemes, expert judgment, or
statistical methods to reconcile conflicting information and create a consistent dataset.

You might also like