DM and DW Notes-Module2
DM and DW Notes-Module2
Module-2
Data Warehouse Implementation and Data Mining
Figure2.1: Lattice of cuboids, making up a 3D data cube. Each cuboid represents a different group_by.
The base cuboid can return total sales for any combination of the three dimensions. The apex
cuboid contains total sum of all sales. The group by’s is empty in apex cuboid. In figure, if we
start at the apex cuboid and explore downward in the lattice is called drilling down within the data
cube.
If we start at the base cuboid and explore upward in the lattice is called rolling up within
the data cube. The cube operator on ‘n’ dimensions is equivalent to a collection of group_by
statements.
The SQL syntax for star schema is
Define cube sales_cube[city, item, year]: sum(sales_in_dollars)
The cube with ‘n’ dimensions contains 2n cuboids. This statement can be represented as,
Compute cube sales_cube
Online analytical processing may need to access different cuboids for different queries. So
the good idea is compute all or atleast some of the cuboids in a data cube in advance.
Precomputation leads to fast response time and avoid some redundant computation. The major
challenge of precomputation is “it needs large amount of storage space”. This problem referred to
as curse of dimensionality.
For ‘n’ dimensional data cube, the total number of cuboids that can be generated is,
Precompute all of the cuboids. The resulting lattice of computed cuboids is referred to as
full cube.
3. Partial Materialization:
Selectively compute subset cuboids of whole cuboids.
Figure2.3: Linkages between a sales fact table and location and item dimension table
In the above Figure2.3, the “main street” value in the location dimension table joins with tuples
T57, T238 and T884 of the sales fact tables.
----------------------------------------------------------------------------------------------------------------
The input data can be stored in a variety of formats (flat files, spreadsheets, or relational tables)
and may reside in a centralized data repository or be distributed across multiple sites.
The purpose of Data preprocessing is to transform the raw input data into an appropriate format
for subsequent Analysis. Data preprocessing is the most laborious and time-consuming step in the
overall knowledge discovery process. The steps involved in data preprocessing are,
1. Fusing data from multiple resources.
2. Cleaning data to remove noise and duplicate observations
3. Selecting records and features that are relevant to the data mining task
A post processing ensures that only valid and useful results are incorporated into the decision
support system. An example of post processing is visualization, which allows analysts to explore
the data and the data mining results from a variety of viewpoints.
1. Scalability
The capacity of data may be Gigabytes, Terabytes or even Petabytes. To handle this amount of
data data mining algorithm must be scalable. Data mining algorithm uses special search strategies
to handle exponential search problems. Scalability can also be improved by using sampling or
developing parallel and distributed algorithms.
2. High Dimensionality
Now a days a data set contains hundreds or thousands of attributes. In bioinformatics, progress in
microarray technology has produced gene expression data involving thousands of features.
3. Heterogeneous and Complex data
Traditional data analysis methods deal data sets which contains the same type of attributes. But
data mining technique can handle data sets which contains heterogeneous attributes.
Examples:
Web pages containing semi-structured text and hyperlinks.
DNA data with sequential and three-dimensional structure.
4. Data Ownership and Distribution
Sometimes, the data needed for an analysis is not stored in one location or owned by one
organization. Instead, the data is geographically distributed in multiple resources. This requires
the development of distributed data mining techniques. The key challenges faced by distributed
data mining algorithms include
1. How to reduce the amount of communication needed to perform the distributed
computation
2. How to effectively consolidate the data mining results obtained from multiple sources
3. How to address data security issue
5. Non-traditional Analysis
In traditional analysis, a hypothesis is proposed, an experiment is designed to gather the data, and
then the data is analyzed with respect to the hypothesis. Unfortunately, this process is extremely
labor intensive. But datasets in data mining uses non-traditional analysis. Because datasets
analyzed in data mining are not result of carefully designed experiments and represent samples of
data rather than random samples.
1. Predictive tasks
The objective of these tasks is to predict the value of a particular attribute based on the
values of other attributes.
The attribute to be predicted is commonly known as the target or dependent variable,
while the attributes used for making the prediction are known as the explanatory or
independent variables.
2. Descriptive tasks
The objective of descriptive tasks is to derive patterns (correlations, trends, clusters,
trajectories, and anomalies) that summarize the underlying relationships in data.
Descriptive data mining tasks are exploratory in nature and frequently require
postprocessing techniques to validate and explain the results.
1. Predictive Modeling
Predictive modeling refers to the task of building a model for the target variable as a
function. There are two types of predictive modeling tasks:
a. Classification, which is used for discrete target variables.
b. Regression, which is used for continuous target variables.
Example1: predicting whether a Web user will make a purchase at an online bookstore is a
classification task because the target variable is binary-valued. On the other hand,
forecasting the future price of a stock is a regression task because price is a continuous-
valued attribute.
Figure2.6: Petal width versus petal length for 150 iris flowers
2. Association analysis
Association analysis is used to discover patterns that describe associated features in the
data.
The discovered patterns are typically represented in the form of implication rules or
feature subsets. Because of the exponential size of its search space, the goal of association
analysis is to extract the most interesting patterns in an efficient manner.
Example1:
Finding groups of genes that have related functionality
Identifying the Web pages that are accessed together
Example2: Market Basket Analysis
The transactions shown in Table 2.1 illustrate point-of-sale data collected at the checkout
counters of a grocery store. Association analysis can be applied to find items that are
frequently bought together by customers. For example, we may discover the rule {Diapers}
{milk}, which suggests that customers who buy diapers also tend to buy milk. This type of
rule can be used to identify potential cross-selling opportunities among related items.
Types of Data
A data set can be viewed as a collection of data objects. The data object can be called as
record, point, vector, pattern, event, case, sample, observation, or entity.
Data objects are described by a number of attributes that capture the basic characteristics
of an object, such as the mass of a physical object or the time at which an event occurred.
The attribute can be called as Variable, characteristic, field, feature, or dimension.
Example: (Student Information)
A data set is a file, in which the objects are records (or rows) in the file and attributes are
field (or column) in the file.
The dimensionality of a data set is the number of attributes that the objects in the data set
possess.
Data with a small number of dimensions tends to be qualitatively different than moderate
or high- dimensional data.
The difficulties associated with analyzing high-dimensional data are sometimes referred to
as the curse of dimensionality.
ii) Sparsity
If most attributes of an object have values of 0 and less than 1% of the entries are non-
zero.
In practical terms, sparsity is an advantage because usually only the non-zero values need
to be stored and manipulated.
iii) Resolution
It is possible to obtain data at different levels of resolution, and often the properties of the
data are different at different resolutions.
Example: The surface of the Earth seems very uneven at a resolution of a few meters, but
is relatively smooth at a resolution of tens of kilometers.
The patterns in the data also depend on the level of resolution.
iv) Record Data
The data set is a collection of records (data objects), each of which consists of a fixed set
of data fields (attributes).
In basic form of record data, there is no explicit relationship among records or data fields,
and every record (object) has the same set of attributes.
Record data is usually stored either in flat files or in relational databases.
v) Transaction or Market Basket Data
Transaction data is a special type of record data, where each record (transaction) involves
a set of items.
Example: Consider a grocery store. The set of products purchased by a customer during
one shopping trip constitutes a transaction, while the individual products that were
purchased are the items. This type of data is called market basket data.
vi) The Data Matrix:
If the data objects in a collection of data have the same fixed set of numeric attributes,
then the data objects can be thought of as points (vectors) in a multidimensional space,
where each dimension represents a distinct attribute describing the object.
A set of such data objects can be interpreted as ‘m’ by ‘n’ matrix, where there are m rows,
one for each object, and n columns, one for each attribute.
vii) The sparse Data Matrix:
A sparse data matrix is a special case of a data matrix in which the attributes are of the
same type and are asymmetric; i.e., only non-zero values are important.
Data Quality
i. Measurement and Data Collection Issues
ii. Issues Related to Applications
Techniques from signal or image processing can be used to reduce noise and thus, help
to discover patterns (signals) that might be "lost in the noise."
Data mining focuses on robust algorithms that produce acceptable results even when
noise is present.
Deterministic distortions of the data are referred to as artifacts.
e. Missing Values:
It is not unusual for an object to be missing one or more attribute values.
In some cases, the information was not collected. For example some people decline to give
their age or weight.
There are several strategies for dealing with missing data
1. Eliminate Data Objects or Attributes
2. Estimate Missing Values
3. Ignore the Missing Value during Analysis
4. Inconsistent Values
5. Duplicate Data
If information about the age and gender of the driver is omitted, then this model will have
limited accuracy unless this information is indirectly available through other attributes.
c. Knowledge about the Data
Normally data sets are accompanied by documentation that describes different aspects of
the data.
The quality of this documentation can help the subsequent analysis.
If the documentation is poor, we cannot do proper analysis.
For example, that the missing values of a particular field are indicated as “999”, then our
analysis of the data may be faulty.
Data Preprocessing
In this topic, we discuss the issue of which preprocessing steps should be applied to make
the data more suitable for data mining. The data preprocessing techniques are shown
below:
1. Aggregation
2. Sampling
3. Dimensionality reduction
4. Feature subset selection
5. Feature creation
6. Discretization and binarization
7. Variable transformation
1. Aggregation:
Sometimes "less is more" and this concept is applied with aggregation.
Aggregation is nothing but combining of two or more objects into a single object.
Example: Consider the following data set,
2. Sampling:
Sampling is a commonly used approach for selecting a subset of the data objects to be
analyzed.
Sampling is used to reduce cost and time. Because we cannot process all the data.
A sample is representative if it has approximately the same property (of interest) as the
original set of data.
Simple Random Sampling- In simple random sampling, there is an equal probability of
selecting any particular item. There are Two types of random sampling:
o Sampling without replacement: