100% found this document useful (1 vote)
1K views

What Motivated Data Mining? Why Is It Important?: The Evolution of Database Technology

The document discusses data mining and knowledge discovery. It provides definitions for data mining as extracting knowledge from large amounts of data. Knowledge discovery is described as an iterative process involving data cleaning, integration, selection, transformation, mining, evaluation, and presentation. The document outlines different types of data that can be mined, including relational databases, data warehouses, transactional databases, and other formats. Finally, it describes descriptive and predictive data mining tasks such as concept/class description, association rule mining, clustering, classification, regression, and deviation detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

What Motivated Data Mining? Why Is It Important?: The Evolution of Database Technology

The document discusses data mining and knowledge discovery. It provides definitions for data mining as extracting knowledge from large amounts of data. Knowledge discovery is described as an iterative process involving data cleaning, integration, selection, transformation, mining, evaluation, and presentation. The document outlines different types of data that can be mined, including relational databases, data warehouses, transactional databases, and other formats. Finally, it describes descriptive and predictive data mining tasks such as concept/class description, association rule mining, clustering, classification, regression, and deviation detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

MODULE – I

INTRODUCTION

WHAT MOTIVATED DATA MINING? WHY IS IT IMPORTANT?

The major reason that data mining has attracted a great deal of attention in information

industry in recent years is due to the wide availability of huge amounts of data and the

imminent need for turning such data into useful information and knowledge. The information

and knowledge gained can be used for applications ranging from business management,

production control, and market analysis, to engineering design and science exploration.

The evolution of database technology

.
WHAT IS DATA MINING?

Data mining refers to extracting or mining knowledge from large amounts of data. There are

many other terms related to data mining, such as knowledge mining, knowledge extraction,

data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a

synonym for another popularly used term, Knowledge Discovery in Databases", or KDD.

Essential step in the process of knowledge discovery in databases

Knowledge discovery as a process is depicted in following figure and consists of


an iterative sequence of the following steps:

Data cleaning: to remove noise or irrelevant data .

Data integration: where multiple data sources may be combined.

Data selection: where data relevant to the analysis task are retrieved from the database.

.
Data transformation: where data are transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations.

Data mining: an essential process where intelligent methods are applied in order
to extract data patterns.

Pattern evaluation: to identify the truly interesting patterns representing


knowledge based on some interestingness measures.

Knowledge presentation: where visualization and knowledge representation techniques


are used to present the mined knowledge to the user.

DATA MINING - ON WHAT KIND OF DATA?

(i) Relational Databases

A relational database consists of a set of tables containing either values of entity


attributes, or values of attributes from entity relationships. Tables have columns and
rows, where columns represent attributes and rows represent tuples. A tuple in a
relational table corresponds to either an object or a relationship between objects and is
identified by a set of attribute values representing a unique key.

(ii) Data warehouses

A data warehouse is a repository of information collected from multiple sources, stored under a

unified schema, and which usually resides at a single site. Data warehouses are constructed

via a process of data cleansing, data transformation, data integration, data loading, and

periodic data refreshing. The figure shows the basic architecture of a data warehouse.

.
In order to facilitate decision making, the data in a data warehouse are organized around major
subjects, such as customer, item, supplier, and activity. The data are stored to provide information
from a historical perspective and are typically summarized.

A data warehouse is usually modeled by a multidimensional database structure, where each


dimension corresponds to an attribute or a set of attributes in the schema, and each cell
stores the value of some aggregate measure, such as count or sales amount. The actual
physical structure of a data warehouse may be a relational data store or a multidimensional
data cube. It provides a multidimensional view of data and allows the pre-computation and
fast accessing of summarized data.

The data cube structure that stores the primitive or lowest level of information is called a base
cuboid. Its corresponding higher level multidimensional (cube) structures are called (non-base)
cuboids. A base cuboid together with all of its corresponding higher level cuboids form a data
cube. By providing multidimensional data views and the pre-computation of summarized data,
data warehouse systems are well suited for On-Line Analytical Processing, or OLAP. OLAP
operations make use of background knowledge regarding the domain of the data being studied in
order to allow the presentation of data at different levels of abstraction. Such operations
accommodate different user viewpoints. Examples of OLAP operations include drill-down and roll-
up, which allow the user to view the data at differing degrees of summarization, as illustrated in
the below figure.

.
(iii) Transactional databases

In general, a transactional database consists of a flat file where each record represents a

transaction. A transaction typically includes a unique transaction identity number (trans ID),

.
and a list of the items making up the transaction (such as items purchased in
a store) as shown below:

(iv) Other kinds of data


An objected-oriented database is designed based on the object-oriented
programming paradigm where data are a large number of objects organized
into classes and class hierarchies. Each entity in the database is considered
as an object. The object contains a set of variables that describe the object,
a set of messages that the object can use to communicate with other objects
or with the rest of the database system and a set of methods where each
method holds the code to implement a message.
A spatial database contains spatial-related data, which may be represented in the

form of raster or vector data. Raster data consists of n-dimensional bit maps or pixel

maps, and vector data are represented by lines, points, polygons or other kinds of

processed primitives, Some examples of spatial databases include geographical

(map) databases, VLSI chip designs, and medical and satellite images databases.

Time-Series Databases contain time related data such stock market data or
logged activities. These databases usually have a continuous flow of new
data coming in, which sometimes causes the need for a challenging real
time analysis. Data mining in such databases commonly includes the
study of trends and correlations between evolutions of different variables,
as well as the prediction of trends and movements of the variables in time.
Text database is a database that contains text documents or other word descriptions in

the form of long sentences or paragraphs, such as product specifications, error or bug

reports, warning messages, summary reports, notes, or other documents.

.
Multimedia database stores images, audio, and video data, and is used in
applications such as picture content-based retrieval, voice-mail systems, video-
on-demand systems, the World Wide Web, and speech-based user interfaces.

World-Wide Web provides rich, world-wide, on-line information services,


where data objects are linked together to facilitate interactive access. Some
examples of distributed information services associated with the World-
Wide Web include America Online, Yahoo!, AltaVista, and Prodigy.

DATA MINING FUNCTIONALITIES/DATA MINING TASKS: WHAT KINDS


OF PATTERNS CAN BE MINED?

Data mining functionalities are used to specify the kind of patterns to be found in
data mining tasks. In general, data mining tasks can be classified into two categories:

Descriptive

Predictive

Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.

(i) Concept/class description: characterization and discrimination

Data can be associated with classes or concepts. It describes a given set of data
in a concise and summarative manner, presenting interesting general properties of
the data. These descriptions can be derived via

data characterization, by summarizing the data of the class under study


(often called the target class)
data discrimination, by comparison of the target class with one or a set
of comparative classes
both data characterization and discrimination

Data characterization

It is a summarization of the general characteristics or features of a target class of data.

.
Example

A data mining system should be able to produce a description summarizing the


characteristics of a student who has obtained more than 75% in every semester; the
result could be a general profile of the student.

Data Discrimination

It is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.

Example

The general features of students with high GPA’s may be compared with the general features of
students with low GPA’s. The resulting description could be a general comparative profile of the
students such as 75% of the students with high GPA’s are fourth year computing science students
while 65% of the students with low GPA’s are not.

The output of data characterization can be presented in various forms. Examples include
pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs. The resulting descriptions can also be presented as generalized
relations, or in rule form called characteristic rules.

Discrimination descriptions expressed in rule form are referred to as discriminant rules.

(ii) Mining Frequent Patterns, Association and

Correlations Frequent patterns and Associations

It is the discovery of association rules showing attribute-value conditions that occur frequently

together in a given set of data. For example, a data mining system may find association rules like

major(X, “computing science””) ⇒ owns(X, “personal computer”)

[support = 12%, confidence = 98%]

where X is a variable representing a student. The rule indicates that of the students under

study, 12% (support) major in computing science and own a personal computer. There is a 98%

probability (confidence, or certainty) that a student in this group owns a personal computer.

.
Example

A grocery store retailer to decide whether to but bread on sale. To help determine the
impact of this decision, the retailer generates association rules that show what other
products are frequently purchased with bread. He finds 60% of the times that bread is
sold so are pretzels and that 70% of the time jelly is also sold. Based on these facts, he
tries to capitalize on the association between bread, pretzels, and jelly by placing
some pretzels and jelly at the end of the aisle where the bread is placed. In addition, he
decides not to place either of these items on sale at the same time.

Correlations

Correlation analysis is a technique use to measure the association between two variables.
A correlation coefficient (r) is a statistic used for measuring the strength of a supposed
linear association between two variables. Correlations range from -1.0 to +1.0 in value.

A correlation coefficient of 1.0 indicates a perfect positive relationship in which high values of

one variable are related perfectly to high values in the other variable, and conversely, low

values on one variable are perfectly related to low values on the other variable.

A correlation coefficient of 0.0 indicates no relationship between the two variables. That is, one

cannot use the scores on one variable to tell anything about the scores on the second variable.

A correlation coefficient of -1.0 indicates a perfect negative relationship in which high values of

one variable are related perfectly to low values in the other variables, and conversely, low

values in one variable are perfectly related to high values on the other variable.

Additional analysis can be performed to uncover interesting statistical correlations


between associated attribute-value pairs.

(iii) Classification and Regression for predictive

analysis Classification:

Classification can be defined as the process of finding a model (or function) that describes
and distinguishes data classes or concepts, for the purpose of being able to use the model
to predict the class of objects whose class label is unknown. The derived model is based
on the analysis of a set of training data (i.e., data objects whose class label is known).

It predicts categorical class labels .

.
It classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data .
Typical Applications

credit approval

target marketing

medical diagnosis

treatment effectiveness analysis

A classification model can be represented in various forms, such as

Example

An airport security screening station is used to deter mine if passengers are potential terrorist

or criminals. To do this, the face of each passenger is scanned and its basic pattern (distance

between eyes, size, and shape of mouth, head etc) is identified. This pattern is compared to

entries in a database to see if it matches any patterns that are associated with known offenders.

Regression

Regression is used to predict missing or unavailable numeric data values rather than class labels.

Regression analysis is a statistical methodology that is most often used for numeric prediction.
Prediction

Find some missing or unavailable data values rather than class labels referred to as prediction.

Although prediction may refer to both data value prediction and class label prediction, it is

usually confined to data value prediction and thus is distinct from classification. Prediction

also encompasses the identification of distribution trends based on the available data.

Example

Predicting flooding is difficult problem. One approach is uses monitors placed at various
points in the river. These monitors collect data relevant to flood prediction: water level,
rain amount, time, humidity etc. These water levels at a potential flooding point in the river
can be predicted based on the data collected by the sensors upriver from this point. The
prediction must be made with respect to the time the data were collected.

Classification vs. Prediction

Classification differs from prediction in that the former is to construct a set of models (or

functions) that describe and distinguish data class or concepts, whereas the latter is to predict

some missing or unavailable, and often numerical, data values. Their similarity is that they are

both tools for prediction: Classification is used for predicting the class label of data objects

and prediction is typically used for predicting missing numerical data values.

(iv) Clustering analysis

Clustering analyzes data objects without consulting a known class label. The
objects are clustered or grouped based on the principle of maximizing the intra-
class similarity and minimizing the interclass similarity. Each cluster that is formed
can be viewed as a class of objects.
Clustering can also facilitate taxonomy formation, that is, the organization of observations
into a hierarchy of classes that group similar events together as shown below:

Example

A certain national department store chain creates special catalogs targeted to various

demographic groups based on attributes such as income, location and physical characteristics

of potential customers (age, height, weight, etc). To determine the target mailings of the various

catalogs and to assist in the creation of new, more specific catalogs, the company performs a

clustering of potential customers based on the determined attribute values. The results of the

clustering exercise are the used by management to create special catalogs and distribute them

to the correct target population based on the cluster for that catalog.

Classification vs. Clustering

In general, in classification you have a set of predefined classes and want


to know which class a new object belongs to.
Clustering tries to group a set of objects and find whether there is some
relationship between the objects.
In the context of machine learning, classification is supervised learning
and clustering is unsupervised learning.
(v) Outlier analysis

A database may contain data objects that do not comply with general model of data. These
data objects are outliers. In other words, the data objects which do not fall within the
cluster will be called as outlier data objects. Noisy data or exceptional data are also called
as outlier data. The analysis of outlier data is referred to as outlier mining.

Example

Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases
of extremely large amounts for a given account number in comparison to regular
charges incurred by the same account. Outlier values may also be detected with
respect to the location and type of purchase, or the purchase frequency.

ARE ALL OF THE PATTERNS INTERESTING? / WHAT MAKES A PATTERN


INTERESTING?

A pattern is interesting if,

It is easily understood by humans,


Valid on new or test data with some degree of
certainty, Potentially useful, and
Novel.

A pattern is also interesting if it validates a hypothesis that the user sought


to confirm. An interesting pattern represents knowledge.
Measures of Pattern Interestingness

There are subjective as well as objective measures of patterns interestingness .

(i) Objective Measures of Patterns Interestingness

Objective Measures of Patterns Interestingness are based on statistics. These measures specify

thresholds on statistical measures of rule interestingness, such as support and confidence .

Support Threshold

Support represents the percentage of transactions from a database that the


given rule X=>Y satisfies. This is taken to be the probability P (X U Y), where
X U Y indicates that a transaction contains both the items X and Y that is the
union of both X and Y. Formally support is defined by:

support X=>Y = P(X U Y)

Confidence Threshold

Confidence threshold assesses the degree of certainty of the discovered rule.


This is taken to be the conditional probability of the rule. The probability that a
transaction containing X also contains Y. It is defined as follows:

Confidence(X=>Y) = P(Y/X)

(ii) Subjective Measures of Patterns Interestingness

Although objective interestingness measures facilitate identifying interesting


patterns, they are ineffective unless combined with subjective measures that
specify the need and interest of the user. Patterns that are expected can be
interesting if they confirm a hypothesis and belief that the user wished to validate.

CLASSIFICATION OF DATA MINING SYSTEMS

There are many data mining systems available or being developed. Some are specialized
systems dedicated to a given data source or are confined to limited data mining
functionalities, other are more versatile and comprehensive. Data mining systems can be
categorized according to various criteria among other classification are the following:
(i) Classification according to mining technologies used

Data mining systems employ and provide different techniques. This classification
categorizes data mining systems according to the data analysis approach used such
as machine learning, neural networks, genetic algorithms, statistics, visualization,
database oriented or data warehouse-oriented, etc. The classification can also take
into account the degree of user interaction involved in the data mining process such
as query-driven systems, interactive exploratory systems, or autonomous systems. A
comprehensive system would provide a wide variety of data mining techniques to fit
different situations and options, and offer different degrees of user interaction.

(ii) Classification according to the type of data source mined

This classification categorizes data mining systems according to the type of data handled such

as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.

(iii) Classification according to the data model

This classification categorizes data mining systems based on the data model involved such

as relational database, object oriented database, data warehouse, transactional, etc.

(iv) Classification according to the king of knowledge discovered

This classification categorizes data mining systems based on the kind of


knowledge discovered or data mining functionalities, such as characterization,
discrimination, association, classification, clustering, etc. Some systems tend to
be comprehensive systems offering several data mining functionalities together.
DATA MINING TASK PRIMITIVES

The user can specify a data mining task in the form of a data mining query. This query
is input to the system. A data mining query is defined in terms of data mining task
primitives. These primitives allow the user to communicate in an interactive manner
with the data mining system. The following are the data mining task primitives:

(i) Task-relevant data

This primitive specifies the data upon which mining is to be performed. It involves
specifying the database and tables or data warehouse containing the relevant data,
conditions for selecting the relevant data, the relevant attributes or dimensions for
exploration, and instructions regarding the ordering or grouping of the data retrieved.

(ii) Knowledge type to be mined

This primitive specifies the specific data mining function to be performed, such as
characterization, discrimination, association, classification, clustering, or evolution
analysis. As well, the user can be more specific and provide pattern templates that
all discovered patterns must match. These templates or meta patterns (also called
meta rules or meta queries), can be used to guide the discovery process.

(iii) Background knowledge

This primitive allows users to specify knowledge they have about the domain to be
mined. Such knowledge can be used to guide the knowledge discovery process
and evaluate the patterns that are found. Of the several kinds of background
knowledge, this task primitive focuses on concept hierarchies.

(iv) Pattern interestingness measure

This primitive allows users to specify functions that are used to separate uninteresting
patterns from knowledge and may be used to guide the mining process, as well as to
evaluate the discovered patterns. This allows the user to confine the number of
uninteresting patterns returned by the process, as a data mining process may
generate a large number of patterns. Interestingness measures can be specified for
such pattern characteristics as simplicity, certainty, utility and novelty.
(v) Visualization of discovered patterns

This primitive refers to the form in which discovered patterns are to be displayed. In order for

data mining to be effective in conveying knowledge to users, data mining systems should be

able to display the discovered patterns in multiple forms such as rules, tables, cross tabs

(cross-tabulations), pie or bar charts, decision trees, cubes or other visual representations.

INTEGRATION OF A DATA MINIG SYSTEM WITH A DATABASE OR DATA


WAREHOUSE SYSTEM

The differences between the following architectures for the integration of a data
mining system with a database or data warehouse system are as follows.

(i) No coupling

The data mining system uses sources such as flat files to obtain the initial data set to be
mined since no database system or data warehouse system functions are implemented as
part of the process. Thus, this architecture represents a poor design choice.

(ii) Loose coupling

The data mining system is not integrated with the database or data warehouse system
beyond their use as the source of the initial data set to be mined, and possible use in
storage of the results. Thus, this architecture can take advantage of the flexibility,
efficiency and features such as indexing that the database and data warehousing systems
may provide. However, it is difficult for loose coupling to achieve high scalability and good
performance with large data sets as many such systems are memory-based.

(iii) Semi-tight coupling

Some of the data mining primitives such as aggregation, sorting or pre computation of
statistical functions are efficiently implemented in the database or data warehouse system,
for use by the data mining system during mining-query processing. Also, some frequently
used inter mediate mining results can be pre computed and stored in the database or data
warehouse system, thereby enhancing the performance of the data mining system.

(iv) Tight coupling

The database or data warehouse system is fully integrated as part of the data mining system and

thereby provides optimized data mining query processing. Thus, the data mining sub system is
treated as one functional component of an information system. This is a highly desirable
architecture as it facilitates efficient implementations of data mining functions, high system
performance, and an integrated information processing environment

From the descriptions of the architectures provided above, it can be seen that tight coupling is

the best alternative without respect to technical or implementation issues. However, as much of

the technical infrastructure needed in a tightly coupled system is still evolving, implementation

of such a system is non-trivial. Therefore, the most popular architecture is currently semi tight

coupling as it provides a compromise between loose and tight coupling.

You might also like