What Motivated Data Mining? Why Is It Important?: The Evolution of Database Technology
What Motivated Data Mining? Why Is It Important?: The Evolution of Database Technology
INTRODUCTION
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The information
and knowledge gained can be used for applications ranging from business management,
production control, and market analysis, to engineering design and science exploration.
.
WHAT IS DATA MINING?
Data mining refers to extracting or mining knowledge from large amounts of data. There are
many other terms related to data mining, such as knowledge mining, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a
synonym for another popularly used term, Knowledge Discovery in Databases", or KDD.
Data selection: where data relevant to the analysis task are retrieved from the database.
.
Data transformation: where data are transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations.
Data mining: an essential process where intelligent methods are applied in order
to extract data patterns.
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and which usually resides at a single site. Data warehouses are constructed
via a process of data cleansing, data transformation, data integration, data loading, and
periodic data refreshing. The figure shows the basic architecture of a data warehouse.
.
In order to facilitate decision making, the data in a data warehouse are organized around major
subjects, such as customer, item, supplier, and activity. The data are stored to provide information
from a historical perspective and are typically summarized.
The data cube structure that stores the primitive or lowest level of information is called a base
cuboid. Its corresponding higher level multidimensional (cube) structures are called (non-base)
cuboids. A base cuboid together with all of its corresponding higher level cuboids form a data
cube. By providing multidimensional data views and the pre-computation of summarized data,
data warehouse systems are well suited for On-Line Analytical Processing, or OLAP. OLAP
operations make use of background knowledge regarding the domain of the data being studied in
order to allow the presentation of data at different levels of abstraction. Such operations
accommodate different user viewpoints. Examples of OLAP operations include drill-down and roll-
up, which allow the user to view the data at differing degrees of summarization, as illustrated in
the below figure.
.
(iii) Transactional databases
In general, a transactional database consists of a flat file where each record represents a
transaction. A transaction typically includes a unique transaction identity number (trans ID),
.
and a list of the items making up the transaction (such as items purchased in
a store) as shown below:
form of raster or vector data. Raster data consists of n-dimensional bit maps or pixel
maps, and vector data are represented by lines, points, polygons or other kinds of
(map) databases, VLSI chip designs, and medical and satellite images databases.
Time-Series Databases contain time related data such stock market data or
logged activities. These databases usually have a continuous flow of new
data coming in, which sometimes causes the need for a challenging real
time analysis. Data mining in such databases commonly includes the
study of trends and correlations between evolutions of different variables,
as well as the prediction of trends and movements of the variables in time.
Text database is a database that contains text documents or other word descriptions in
the form of long sentences or paragraphs, such as product specifications, error or bug
.
Multimedia database stores images, audio, and video data, and is used in
applications such as picture content-based retrieval, voice-mail systems, video-
on-demand systems, the World Wide Web, and speech-based user interfaces.
Data mining functionalities are used to specify the kind of patterns to be found in
data mining tasks. In general, data mining tasks can be classified into two categories:
Descriptive
Predictive
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Data can be associated with classes or concepts. It describes a given set of data
in a concise and summarative manner, presenting interesting general properties of
the data. These descriptions can be derived via
Data characterization
.
Example
Data Discrimination
It is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.
Example
The general features of students with high GPA’s may be compared with the general features of
students with low GPA’s. The resulting description could be a general comparative profile of the
students such as 75% of the students with high GPA’s are fourth year computing science students
while 65% of the students with low GPA’s are not.
The output of data characterization can be presented in various forms. Examples include
pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs. The resulting descriptions can also be presented as generalized
relations, or in rule form called characteristic rules.
It is the discovery of association rules showing attribute-value conditions that occur frequently
together in a given set of data. For example, a data mining system may find association rules like
where X is a variable representing a student. The rule indicates that of the students under
study, 12% (support) major in computing science and own a personal computer. There is a 98%
probability (confidence, or certainty) that a student in this group owns a personal computer.
.
Example
A grocery store retailer to decide whether to but bread on sale. To help determine the
impact of this decision, the retailer generates association rules that show what other
products are frequently purchased with bread. He finds 60% of the times that bread is
sold so are pretzels and that 70% of the time jelly is also sold. Based on these facts, he
tries to capitalize on the association between bread, pretzels, and jelly by placing
some pretzels and jelly at the end of the aisle where the bread is placed. In addition, he
decides not to place either of these items on sale at the same time.
Correlations
Correlation analysis is a technique use to measure the association between two variables.
A correlation coefficient (r) is a statistic used for measuring the strength of a supposed
linear association between two variables. Correlations range from -1.0 to +1.0 in value.
A correlation coefficient of 1.0 indicates a perfect positive relationship in which high values of
one variable are related perfectly to high values in the other variable, and conversely, low
values on one variable are perfectly related to low values on the other variable.
A correlation coefficient of 0.0 indicates no relationship between the two variables. That is, one
cannot use the scores on one variable to tell anything about the scores on the second variable.
A correlation coefficient of -1.0 indicates a perfect negative relationship in which high values of
one variable are related perfectly to low values in the other variables, and conversely, low
values in one variable are perfectly related to high values on the other variable.
analysis Classification:
Classification can be defined as the process of finding a model (or function) that describes
and distinguishes data classes or concepts, for the purpose of being able to use the model
to predict the class of objects whose class label is unknown. The derived model is based
on the analysis of a set of training data (i.e., data objects whose class label is known).
.
It classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data .
Typical Applications
credit approval
target marketing
medical diagnosis
Example
An airport security screening station is used to deter mine if passengers are potential terrorist
or criminals. To do this, the face of each passenger is scanned and its basic pattern (distance
between eyes, size, and shape of mouth, head etc) is identified. This pattern is compared to
entries in a database to see if it matches any patterns that are associated with known offenders.
Regression
Regression is used to predict missing or unavailable numeric data values rather than class labels.
Regression analysis is a statistical methodology that is most often used for numeric prediction.
Prediction
Find some missing or unavailable data values rather than class labels referred to as prediction.
Although prediction may refer to both data value prediction and class label prediction, it is
usually confined to data value prediction and thus is distinct from classification. Prediction
also encompasses the identification of distribution trends based on the available data.
Example
Predicting flooding is difficult problem. One approach is uses monitors placed at various
points in the river. These monitors collect data relevant to flood prediction: water level,
rain amount, time, humidity etc. These water levels at a potential flooding point in the river
can be predicted based on the data collected by the sensors upriver from this point. The
prediction must be made with respect to the time the data were collected.
Classification differs from prediction in that the former is to construct a set of models (or
functions) that describe and distinguish data class or concepts, whereas the latter is to predict
some missing or unavailable, and often numerical, data values. Their similarity is that they are
both tools for prediction: Classification is used for predicting the class label of data objects
and prediction is typically used for predicting missing numerical data values.
Clustering analyzes data objects without consulting a known class label. The
objects are clustered or grouped based on the principle of maximizing the intra-
class similarity and minimizing the interclass similarity. Each cluster that is formed
can be viewed as a class of objects.
Clustering can also facilitate taxonomy formation, that is, the organization of observations
into a hierarchy of classes that group similar events together as shown below:
Example
A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location and physical characteristics
of potential customers (age, height, weight, etc). To determine the target mailings of the various
catalogs and to assist in the creation of new, more specific catalogs, the company performs a
clustering of potential customers based on the determined attribute values. The results of the
clustering exercise are the used by management to create special catalogs and distribute them
to the correct target population based on the cluster for that catalog.
A database may contain data objects that do not comply with general model of data. These
data objects are outliers. In other words, the data objects which do not fall within the
cluster will be called as outlier data objects. Noisy data or exceptional data are also called
as outlier data. The analysis of outlier data is referred to as outlier mining.
Example
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases
of extremely large amounts for a given account number in comparison to regular
charges incurred by the same account. Outlier values may also be detected with
respect to the location and type of purchase, or the purchase frequency.
Objective Measures of Patterns Interestingness are based on statistics. These measures specify
Support Threshold
Confidence Threshold
Confidence(X=>Y) = P(Y/X)
There are many data mining systems available or being developed. Some are specialized
systems dedicated to a given data source or are confined to limited data mining
functionalities, other are more versatile and comprehensive. Data mining systems can be
categorized according to various criteria among other classification are the following:
(i) Classification according to mining technologies used
Data mining systems employ and provide different techniques. This classification
categorizes data mining systems according to the data analysis approach used such
as machine learning, neural networks, genetic algorithms, statistics, visualization,
database oriented or data warehouse-oriented, etc. The classification can also take
into account the degree of user interaction involved in the data mining process such
as query-driven systems, interactive exploratory systems, or autonomous systems. A
comprehensive system would provide a wide variety of data mining techniques to fit
different situations and options, and offer different degrees of user interaction.
This classification categorizes data mining systems according to the type of data handled such
as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.
This classification categorizes data mining systems based on the data model involved such
The user can specify a data mining task in the form of a data mining query. This query
is input to the system. A data mining query is defined in terms of data mining task
primitives. These primitives allow the user to communicate in an interactive manner
with the data mining system. The following are the data mining task primitives:
This primitive specifies the data upon which mining is to be performed. It involves
specifying the database and tables or data warehouse containing the relevant data,
conditions for selecting the relevant data, the relevant attributes or dimensions for
exploration, and instructions regarding the ordering or grouping of the data retrieved.
This primitive specifies the specific data mining function to be performed, such as
characterization, discrimination, association, classification, clustering, or evolution
analysis. As well, the user can be more specific and provide pattern templates that
all discovered patterns must match. These templates or meta patterns (also called
meta rules or meta queries), can be used to guide the discovery process.
This primitive allows users to specify knowledge they have about the domain to be
mined. Such knowledge can be used to guide the knowledge discovery process
and evaluate the patterns that are found. Of the several kinds of background
knowledge, this task primitive focuses on concept hierarchies.
This primitive allows users to specify functions that are used to separate uninteresting
patterns from knowledge and may be used to guide the mining process, as well as to
evaluate the discovered patterns. This allows the user to confine the number of
uninteresting patterns returned by the process, as a data mining process may
generate a large number of patterns. Interestingness measures can be specified for
such pattern characteristics as simplicity, certainty, utility and novelty.
(v) Visualization of discovered patterns
This primitive refers to the form in which discovered patterns are to be displayed. In order for
data mining to be effective in conveying knowledge to users, data mining systems should be
able to display the discovered patterns in multiple forms such as rules, tables, cross tabs
(cross-tabulations), pie or bar charts, decision trees, cubes or other visual representations.
The differences between the following architectures for the integration of a data
mining system with a database or data warehouse system are as follows.
(i) No coupling
The data mining system uses sources such as flat files to obtain the initial data set to be
mined since no database system or data warehouse system functions are implemented as
part of the process. Thus, this architecture represents a poor design choice.
The data mining system is not integrated with the database or data warehouse system
beyond their use as the source of the initial data set to be mined, and possible use in
storage of the results. Thus, this architecture can take advantage of the flexibility,
efficiency and features such as indexing that the database and data warehousing systems
may provide. However, it is difficult for loose coupling to achieve high scalability and good
performance with large data sets as many such systems are memory-based.
Some of the data mining primitives such as aggregation, sorting or pre computation of
statistical functions are efficiently implemented in the database or data warehouse system,
for use by the data mining system during mining-query processing. Also, some frequently
used inter mediate mining results can be pre computed and stored in the database or data
warehouse system, thereby enhancing the performance of the data mining system.
The database or data warehouse system is fully integrated as part of the data mining system and
thereby provides optimized data mining query processing. Thus, the data mining sub system is
treated as one functional component of an information system. This is a highly desirable
architecture as it facilitates efficient implementations of data mining functions, high system
performance, and an integrated information processing environment
From the descriptions of the architectures provided above, it can be seen that tight coupling is
the best alternative without respect to technical or implementation issues. However, as much of
the technical infrastructure needed in a tightly coupled system is still evolving, implementation
of such a system is non-trivial. Therefore, the most popular architecture is currently semi tight