3.data Pre-Processing Concepts
3.data Pre-Processing Concepts
Data Cleaning
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
i. Missing Data
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
Ignore the tuples: This approach is suitable only when the
dataset we have is quite large and multiple values are missing
within a tuple.
Fill the missing value: There are various ways to do this task.
You can choose to fill the missing values manually, by
attribute mean or the most probable value.
Data integration
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management. There
are some problems to be considered during data integration.
Schema integration: Integrates metadata (a set of data that
describes other data) from different sources.
Entity identification problem: Identifying entities from
multiple databases. For example, the system or the use should
know student _id of one database and student_name of
another database belongs to the same entity.
Detecting and resolving data value concepts: The data
taken from different databases while merging may differ.
Like the attribute values from one database may differ from
another database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.
Data transformation
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
Normalization: It is done in order to scale the data values in a specified
range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection: In this strategy, new attributes are constructed
from the given set of attributes to help the mining process.
Discretization: This is done to replace the raw values of numeric
attribute by interval levels or conceptual levels.
Concept Hierarchy Generation: Here attributes are converted from
lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.
Data Reduction
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we use data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Data Cube aggregation: Aggregation operation is applied to data for
the construction of the data cube.
Attribute Subset: The highly relevant attributes should be used; rest
all can be discarded. For performing attribute selection, one can use
level of significance and p- value of the attribute. The attribute having
p-value greater than significance level can be discarded.
Numerosity Reduction: This enable to store the model of data instead
of whole data, for example: Regression Models.
Dimensionality reduction: This reduce the size of data by encoding
mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are: Wavelet transforms
and PCA (Principal Component Analysis).
3.3 Data discretization and Concept Hierarchy Generation
Data Discretization
Example:
Suppose we have an attribute of Age with the given values
Age 1,2,6,9,11,15,17,18,19,31,35,45,58,59,61,65,71,75
Example:
A particular city can map with the belonging country. For example, New Delhi
can be mapped to India, and India can be mapped to Asia.
Top Down Mapping: Top-down mapping generally starts with the top
with some general information and ends with the bottom to the
specialized information.
Bottom up Mapping: Bottom-up mapping generally starts with the
bottom with some specialized information and ends with the top to the
generalized information.
3.4 DMQL
The Data Mining Query Language is actually based on the Structured Query
Language (SQL). Data Mining Query Languages can be designed to support ad hoc
and interactive data mining. This DMQL provides commands for specifying
primitives. The DMQL can work with databases and data warehouses as well.
DMQL can be used to define data mining tasks. Particularly we examine how to
define data warehouses and data marts in DMQL.
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al.
for the DBMiner data mining system.
Syntax of DMQL:
OR
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
a) Characterization
b) Discrimination
mine comparison [as {pattern_name]}
For {target_class } where {target_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
c) Association
d) Classification