Unit I Data Mining
Unit I Data Mining
Basis of
Comparison Data Warehousing Data Mining
A data warehouse
is a database Data mining is the
Definition system that is process of analyzing
designed for data patterns.
analytical analysis
Basis of
Comparison Data Warehousing Data Mining
instead of
transactional work.
Data warehousing
is the process of
Data mining is the use of
extracting and
Purpose pattern recognition logic
storing data to
to identify patterns.
allow easier
reporting.
Data warehousing
is the process of
extracting and Pattern recognition logic
Task storing data in is used in data mining to
order to make find patterns.
reporting more
efficient.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that
involves the extraction of useful, previously unknown, and
potentially valuable information from large datasets. The KDD
process is an iterative process and it requires multiple iterations
of the above steps to extract accurate knowledge from the
data.The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data
from collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance
error.
3. Cleaning with Data discrepancy detection and Data
transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse). Data
integration using Data Migration tools, Data Synchronization
tools and ETL(Extract-Load-Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to
the analysis is decided and retrieved from the data
collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming
data into appropriate form required by mining procedure. Data
Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to
destination to capture transformations.
2. Code generation: Creation of the actual transformation
program.
Data Mining
Data mining is defined as techniques that are applied to extract
patterns potentially useful. It transforms task relevant data
into patterns, and decides purpose of model
using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing
patterns representing knowledge based on given measures. It
find interestingness score of each pattern, and
uses summarization and Visualization to make data
understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful
and can be used to make decisions.
Note: KDD is an iterative process where evaluation measures
can be enhanced, mining can be refined, new data can be
integrated and transformed in order to get different and more
appropriate results.Preprocessing of databases consists of Data
cleaning and Data Integration.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights
and knowledge that can help organizations make better
decisions.
2. Increased efficiency: KDD automates repetitive and time-
consuming tasks and makes the data ready for analysis,
which saves time and money.
3. Better customer service: KDD helps organizations gain a
better understanding of their customers’ needs and
preferences, which can help them provide better customer
service.
4. Fraud detection: KDD can be used to detect fraudulent
activities by identifying patterns and anomalies in the data
that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive
models that can forecast future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it
involves collecting and analyzing large amounts of data,
which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires
specialized skills and knowledge to implement and
interpret the results.
3. Unintended consequences: KDD can lead to unintended
consequences, such as bias or discrimination, if the data or
models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality
of data, if data is not accurate or consistent, the results can
be misleading
5. High cost: KDD can be an expensive process, requiring
significant investments in hardware, software, and
personnel.
6. Overfitting: KDD process can lead to overfitting, which is a
common problem in machine learning where a model
learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the
model on new unseen data.
Difference between KDD and Data Mining
patterns and
relationships in data.
To extract useful
To find useful
Objective information from
knowledge from data.
data.
Focus is on the
Data mining focus is
discovery of useful
on the discovery of
Focus knowledge, rather than
patterns or
simply finding patterns
relationships in data.
in data.
Domain expertise is
Domain expertise is
less critical in data
important in KDD, as it
mining, as the
Role of helps in defining the
algorithms are
domain goals of the process,
designed to identify
expertise choosing appropriate
patterns without
data, and interpreting
relying on prior
the results.
knowledge.
Data Mining Techniques