Steps in The Data Mining Process
Steps in The Data Mining Process
The data mining process is divided into two parts i.e. Data Preprocessing and Data Mining.
Data Preprocessing involves data cleaning, data integration, data reduction, and data
transformation. The data mining part performs data mining, pattern evaluation and
knowledge representation of data.
Why do we preprocess the data?
There are many factors that determine the usefulness of data such as accuracy, completeness,
consistency, timeliness. The data has to quality if it satisfies the intended purpose. Thus
preprocessing is crucial in the data mining process. The major steps involved in data
preprocessing are explained below.
Data cleaning is the first step in data mining. It holds importance as dirty data if used directly
in mining can cause confusion in procedures and produce inaccurate results.
Basically, this step involves the removal of noisy or incomplete data from the collection.
Many methods that generally clean data by itself are available but they are not robust.
(ii) Remove The Noisy Data: Random error is called noisy data.
Binning: Binning methods are applied by sorting values into buckets or bins. Smoothening is
performed by consulting the neighboring values.
Binning is done by smoothing by bin i.e. each bin is replaced by the mean of the bin.
Smoothing by a median, where each bin value is replaced by a bin median. Smoothing by bin
boundaries i.e. The minimum and maximum values in the bin are bin boundaries and each
bin value is replaced by the closest boundary value.
When multiple heterogeneous data sources such as databases, data cubes or files are
combined for analysis, this process is called data integration. This can help in improving the
accuracy and speed of the data mining process.
Data Integration can be performed using Data Migration Tools such as Oracle Data Service
Integrator and Microsoft SQL etc.
This technique is applied to obtain relevant data for analysis from the collection of data. The
size of the representation is much smaller in volume while maintaining integrity. Data
Reduction is performed using methods such as Naive Bayes, Decision Trees, Neural network,
etc.
In this process, data is transformed into a form suitable for the data mining process. Data is
consolidated so that the mining process is more efficient and the patterns are easier to
understand. Data Transformation involves Data Mapping and code generation process.
Smoothing: Removing noise from data using clustering, regression techniques, etc.
Aggregation: Summary operations are applied to data.
Normalization: Scaling of data to fall within a smaller range.
Discretization: Raw values of numeric data are replaced by intervals. For Example,
Age.
Data Mining is a process to identify interesting patterns and knowledge from a large amount
of data. In these steps, intelligent patterns are applied to extract the data patterns. The data is
represented in the form of patterns and models are structured using classification and
clustering techniques.
#6) Pattern Evaluation
This step involves identifying interesting patterns representing the knowledge based on
interestingness measures. Data summarization and visualization methods are used to make
the data understandable by the user.