Data Mining and Database Systems Where Is The Intersection

Data Mining promises a giant leap over OLAP where instead of a power OLAP user navigating data, the mining tools will automatically discover interesting patterns. This raises the question as to what role, if any, database systems research may contribute to area of data mining.

Uploaded by

jorge051289

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Data Mining and Database Systems Where Is The Intersection

Uploaded by

jorge051289

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Data Mining and Database Systems: Where is the Intersection?

Surajit Chaudhuri Microsoft Research Email: [email protected]

1 Introduction
The promise of decision support systems is to exploit enterprise data for competitive advantage. The process of deciding what data to collect and how to clean such data raises nontrivial issues. However, even after a data warehouse has been set up, it is often difcult to analyze and assimilate data in a warehouse. OLAP takes an important rst step at the problem by allowing us to view data multidimensionally as a giant spreadsheet with sophisticated visual tools to browse and query the data (See [3] for a survey). Data Mining promises a giant leap over OLAP where instead of a power OLAP user navigating data, the mining tools will automatically discover interesting patterns. Such functionality will be very useful in enterprise databases that are characterized by a large schema as well as large number of rows. Data Mining involves data analysis techniques that have been used by statisticians and machine learning community for quite some time now (generically referred to data analysts in this paper). This raises the question as to what role, if any, database systems research may contribute to area of data mining. In this article, I will try to present my biased view on this issue and argue that (1) we need to focus on generic scalability requirements (rather than on features tuned to specic algorithms) wherever possible and (2) we need to try to build data mining systems that are not just scalable, but SQL-aware.

2 Data Mining Landscape

We can categorize the ongoing work in data mining area as follows: Inventing new data analysis techniques Scaling data analysis technique over large data sets

2.1 Inventing New Data Analysis Techniques

In my opinion, discovery of new data analysis technique is to a large extent an expertise that requires insight in statistical and machine learning and related algorithmic areas. Examples of well-known techniques include decision-tree classication, clustering (see [5] for an overview of known techniques). Innovating in this space requires establishing statistical merit of a proposed technique and appears to have little interaction with database system issues. On a more pragmatic note, we seem to have a large number of established techniques in this space.
Copyright 1998 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering

2.2 Scaling Data Analysis Techniques

In contrast to inventing new data analysis techniques, the problem of scaling analysis techniques seems far more familiar for us. Although data analysis experts have worked for quite some time on the problems where the number of dimensions (data attributes) is large, they have been much less concerned with the number of data records. In part, this is because it has been traditional in data analysis work to make assumptions about the data distributions that model a data set. Assumptions on data distribution help reduce the size of the necessary data sets. In contrast, in most databases, little a priori inference on data distribution may be assumed. Thus, while most statistical and machine learning schemes assume a single-level store, over large data sets, we recognize the reality of multi-level store. This leads to two possible consequences: Develop efcient algorithms that take into account the fact that the data set is large. Restrict the scope of analysis objectives. Scalability requirements lead to algorithms that carefully stage computation when data do not t in memory. This is an area we have been leveraging. Nonetheless, we have to guard against the following dangers when we consider scalable implementations: Restricting choice of data mining tasks: While there has been an impressive amount of work related to association rules (see [1] for an overview) and their generalizations. Relatively less work has been done in the context of other classical data analysis technique, e.g., clustering, classication. Scaling specic algorithms: There are literally many variants of classication or clustering algorithms. The specic choice of an algorithm will depend on an application. Therefore, instead of focusing on how to scale a specic algorithm, we should try to identify common data centric steps in a broad class of algorithms. For example, all decision tree classiers are driven by data-centric operations of building counts for distinct values of attributes and then partitioning the data set. Therefore, any decision tree classier can be minimally modied so that whenever counting or partitioning steps are needed, the classier uses an interface to invoke a generic middleware that optimizes scalable implementations of those operations by leveraging the way most decision tree classiers grow a tree [4]. As a consequence, we are able to exploit the middleware for the entire set of decision tree classiers, instead of requiring to build a specic scalable implementations for many variants. Ignoring sampling as a scaling methodology: Another generic way to scale the data over large data set is to use sampling. Whether sampling is appropriate for a class of data analysis technique, and even when appropriate, how much to sample and how, will become increasingly important question for data analysts. Use of sampling directly brings us to the related issue of restricting the scope of analysis objectives. In designing the scalable implementations, some of the algorithms have made assumptions that ignore the fact that a datawarehouse will service not just data mining, but also traditional query processing. For example, some of the algorithms discuss how the physical design of the database may be tuned for a specic data mining task. However, in many cases, the physical design of a data warehouse is unlikely to be guided solely by the requirement of a single data analysis algorithm. Many of the scalable implementations also do not consider SQL database as the repository of data. In the next section, I will discuss this issue in somewhat more detail. The problem of restricting the scope of analysis objective is motivated by the desire to strike a balance between accuracy and exhaustiveness of analysis with the desire to be efcient. While not a novel concept by any means, of direct interest to us will be techniques to efciently cut corners that are motivated specically by the large database (records and schema) scenarios. Restricting the scope of the analysis can take different forms. First, the analysis can be less than exhaustive. For example, support and condence parameters are used to restrict the set of association rules that are mined. Next, the guarantee of the analysis can be probabilistic. A wide 2

class of sampling based algorithms can take advantage of such an approximate analysis. This is clearly an area that is rich with past work by AI/statistics community. To ensure that we avoid pitfalls in cutting corners that severely affect quality of analysis, a database systems person needs to carefully work with a data analysis expert.

3 Motivation for SQL-aware Data Mining Systems

I will discuss two obvious reasons why we need to consider implementation of data mining algorithms that are SQL-aware. However, from my point of view, ad-hoc mining provides the most compelling reason to consider SQL-aware data mining systems. Data is in the warehouse Data warehouses are deploying relational database technology for storing and maintaining data. Furthermore, data in datawarehouse will not be exclusively used for data mining, but will be shared also by OLAP and other database utilities. Therefore, for pragmatic reasons, the data mining utilities must assume a relational backend. SQL Systems can be leveraged Apart from the reality that SQL databases hold enterprise data, it is also true that SQL database management systems provide a rich set of primitives for data retrieval that the mining algorithms can exploit instead of developing all required functionality from scratch. It is surprising that although scalability of mining algorithms has been an active area of work, few signicant pieces of work have looked at the issue of data mining algorithms for SQL systems. A nice study of a SQL-aware scalable implementation of association rules appear in [8]. Ad-hoc Mining Todays data mining algorithms are invoked on a materialized disk-resident data set. If data mining were to succeed, data mining must evolve to ad-hoc data mining where the data set which is mined is specied onthe-y. In other words, mining may be invoked on a data set that has been created on-the-y by the powerful query tools. This allows us to mine an arbitrary query, not necessarily just base data. Thus, it is possible for a power user to use OLAP tools to specify a subset of the data and then to invoke mining tools on that data (cf. [7]). Likewise, it is possible to exploit the data reduction capabilities of the data mining tools to identify a subset of data that is interesting and then the OLAP tools can explore the subset of the data using query features. As an example, mining tool may be used to reduce the dimensionality of data. For ad-hoc mining, requiring the query to be materialized will result in unacceptable performance in many cases. A far more sensible approach is to cleverly exploit the interaction of the mining operator with the SQL operators. Similar interactions are possible with data visualization tools.

4 Building SQL-aware Data Mining Systems

In this section, I will use the example of decision-tree classication as an example of a data analysis technique, to illustrate the issues related to integration. The decision-tree classication process begins with the root node of the tree representing the entire data set. For each data value for each attribute of the data set, the counts of tuples are computed. These counts are used to determine a criteria to either partition the data set into a set of a disjoint partitions based on values of a specic attribute or to conclude that the node (the root in this case) is a leaf node in the decision tree. This count and split cycle is repeated until no new partitions are possible.

Recently, we built a scalable classier [4] at Microsoft Research. Our approach exemplies one of the several ways in which the problem of building SQL-aware systems may be approached. We started with a classical mainmemory implementation of a decision tree classier and Microsoft SQL Server. We augmented this set-up with a middleware to enhance performance. In particular, we modied the in-memory classier such that it invokes the middleware whenever it needed to generate counts for each active node. In our rst implementation, our goal was to get the SQL backend to do all the work in generating counts. The implementation helped us identify key bottlenecks and led to implementation changes for optimal use of server functionality as well as led to needs for SQL extensions. In the rest of this section, I will briey discuss the issues related to exploiting the SQL backend as well as issues related to extensions to SQL for data mining. Where appropriate, I will draw examples from the decision-tree classier and association rule implementations.

4.1 Using SQL Backend

Effectively using a SQL backend for data mining applications is a nontrivial problem since using the SQL backend as much as possible in an obvious way may hurt performance. The problem is analogous to what ROLAP providers faced in building their middleware over SQL engines. In particular, instead of generating a single complex SQL statement against the backend, they often generate multi-statement SQL that may be executed more efciently. Similar considerations will be needed in the context of data mining. On the other hand, we need to exploit the functionality in the SQL subsystem that can indeed be leveraged. Physical database design1 and query processing subsystem including its use of parallelism are examples of functionality that data mining applications can exploit. Although the above seems too obvious to mention, there are few implementations of data mining algorithms today that take advantage of these functionality. The goal of harnessing the above functionality often lead to novel ways of staging computation. In [4], we discuss how we can batch servicing multiple active nodes (i.e., nodes that are still being grown) of a scalable decision tree classication algorithm and exploit data structures in the database server.

4.2 SQL Extensions

As we implement mining algorithms that generate SQL efciently, we also will identify primitives that need to be incorporated in SQL. Once again, we can draw similarities with the OLAP world. Generation of SQL queries against the backend clearly benets from the CUBE construct [6]. We can identify two goals for studying possible extensions to SQL, extensions that: 1. strongly interact with core SQL primitives and can result in signicant performance improvement. 2. encapsulate a set of useful data mining primitives. We feel that extensions that belong to (1) are extremely useful. An example of an operator which belongs there is the ability to sample a relation and more generally a query. This functionality can be exploited by many data mining algorithms, especially algorithms that provide probabilistic guarantees. However, the operation to sample a query is also a feature that strongly interacts with the query system. In particular, a sampling operator can be pushed down past a selection and interacts with other relational operators. While there has been substantial past research in this area, implementation issues related to processing sampling along with other relational operators continue to be an active area. In our recent work on building scalable classication algorithms over SQL subsystems [4], we recognized that there is strong performance incentive to do batch aggregation. Intuitively, batch aggregation helps fully leverage a single data scan by evaluating multiple aggregation over the same data (or, query). This functionality
It is important to emphasize that data mining algorithms need to exploit the physical design, but should not assume that such algorithms will singularly dictate such designs.
1

is important in classication since while growing a decision-tree classier, for every active (non-leaf) node, we must collect the count of the number of tuples for every value of every attribute of the data table. This corresponds to a set of single-block aggregation queries where all the queries share the same From and Where clauses, i.e., differ only on Group By and Select clauses. Having the ability to do multi-statement optimization of the above set of related queries and to exploit a single data scan to evaluate them greatly speed up the classication algorithms. Such batch aggregation functionality goes beyond the CUBE operator [6]. Both sampling and batch aggregation strongly interact with core SQL primitives and thus with the SQL relational engine implementations. We distinguish the above set of new operators with those that do not strongly interact with core SQL but presents a set of useful encapsulated procedures, perhaps supported via a mining extender/cartridge/blade or simply via extended system-provided stored procedures (depending on the database vendor). The purpose of such a set of operators is to make it easy to develop new data mining applications. However, for the set of primitives in this class to be useful, it is important that the operations be unbundled so that they may be shared. This issue is best illustrated through a recent SQL extension that has been proposed for association rules [2]. In that proposal, an extension is proposed to generate association rules. However, note that an alternative would have been to consider specifying frequent itemsets instead. An association rule can be derived easily with frequent itemsets as the primitive. Furthermore, the construct for frequent itemset may be exploited more generally for different variants of association rules. Thus, building extenders that directly map one-on-one to individual data mining algorithms may not be ideal.

5 Conclusion
In this article, I have reviewed various facets of data mining. There is an opportunity to work closely with data analysts to develop approximations of classical data analysis that are scalable. There is a need to look for generic scalability extensions for each class of data mining algorithms, rather than for specic scalable algorithms in each class. Another important direction is to consider scalable implementations over SQL systems. Such an effort will lead not only to changes in scalable algorithms, but also lead to new extensions for SQL that will make it better suited to support a variety of data mining utilities. Finally, it is well understood that core data mining algorithms by themselves are not sufcient, but needs to be integrated with other database tools. In particular, it is necessary to augment them with visualization support, as has been done in OLAP. Acknowledgement We thank Umesh Dayal, Usama Fayyad, Goetz Graefe, and Jim Gray for many fruitful discussions.

References
[1] Agrawal R. et. al. Fast Dicovery of Association Rules, pp. 307-328 in [5]. [2] Meo R., P. Giuseppe, Ceri S., A new SQL-like Operator for Mining Association Rules, in Proc. of VLDB96, pp. 122-133, Mumbai, India. [3] Chaudhuri S., Dayal U. An Overview of Datawarehousing and OLAP Technology, in Sigmod Record, March 1997. [4] Chaudhuri S., Fayyad U., Bernhardt J. Scalable Classier over SQL Databases, in preparation. [5] Fayyad U. et. al. Advances in Knowledge Discovery and Data Mining, MIT Press, 1996. [6] Gray et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub Totals,in Data Mining and Knowledge Discovery, 1(1), pp. 29-53, 1997. [7] Han J. Towards On-Line Analytical Mining in Large Databases,to appear. [8] Sarawagi S., Thomas S., Agrawal R. Integrating Mining with Relational Database Systems: Alternatives and Implications, in Proc. of ACM Sigmod 98, To appear.

Conference Management System Uml Diagrams
75% (4)
Conference Management System Uml Diagrams
9 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Fundamental Concepts of A Database System
100% (2)
Fundamental Concepts of A Database System
23 pages
Oracle DB Basic Commands
75% (4)
Oracle DB Basic Commands
1 page
Data Mining and Database Systems: Where Is The Intersection?
No ratings yet
Data Mining and Database Systems: Where Is The Intersection?
5 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
DM Unit 2
No ratings yet
DM Unit 2
19 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
792 pages
Mining Databases: Towards Algorithms For Knowledge Discovery
No ratings yet
Mining Databases: Towards Algorithms For Knowledge Discovery
10 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
46 pages
DM-Unit_1
No ratings yet
DM-Unit_1
13 pages
DMW-M1-Ktunotes.in
No ratings yet
DMW-M1-Ktunotes.in
75 pages
Kinds of Data: 1. Data Bases Data 2.data Warehouses Data 3. Transactional Data
No ratings yet
Kinds of Data: 1. Data Bases Data 2.data Warehouses Data 3. Transactional Data
24 pages
Assignment 1
No ratings yet
Assignment 1
11 pages
A Survey On Data Mining
No ratings yet
A Survey On Data Mining
4 pages
Chapter 1 - What is Data Mining
No ratings yet
Chapter 1 - What is Data Mining
8 pages
DM Unit2(Part1)
No ratings yet
DM Unit2(Part1)
19 pages
Data Mining: Discovering Hidden Value in Your Data Warehouse
No ratings yet
Data Mining: Discovering Hidden Value in Your Data Warehouse
6 pages
Data Mining
100% (3)
Data Mining
18 pages
Data Mining
No ratings yet
Data Mining
26 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
11 pages
Introduction-DM2
No ratings yet
Introduction-DM2
13 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
16 pages
solved DM questions
No ratings yet
solved DM questions
6 pages
Sakhr - Chaib - Paper On Data Mining
No ratings yet
Sakhr - Chaib - Paper On Data Mining
3 pages
Data Mining Report
100% (1)
Data Mining Report
15 pages
Data Mining
No ratings yet
Data Mining
14 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
data_mining_2
No ratings yet
data_mining_2
59 pages
1.1 Project Overview: Data Mining
No ratings yet
1.1 Project Overview: Data Mining
74 pages
DataMiningForTheMasses Cap 1
No ratings yet
DataMiningForTheMasses Cap 1
10 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
12 pages
Data Mining
No ratings yet
Data Mining
11 pages
Data Mining
No ratings yet
Data Mining
7 pages
Data Mining Issues
No ratings yet
Data Mining Issues
5 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
12 pages
Data Mining AND Warehousing: Abstract
No ratings yet
Data Mining AND Warehousing: Abstract
12 pages
Trends in Data Mining
No ratings yet
Trends in Data Mining
9 pages
Data Mining System and Applications A Re
No ratings yet
Data Mining System and Applications A Re
13 pages
Data Mining Report (Final) 1
50% (2)
Data Mining Report (Final) 1
44 pages
Data Mining v3
No ratings yet
Data Mining v3
54 pages
Solutions To DM I MID (A)
100% (1)
Solutions To DM I MID (A)
19 pages
Synopsis Print
No ratings yet
Synopsis Print
4 pages
Unit 1
No ratings yet
Unit 1
36 pages
Report 4 On Big Data-1
No ratings yet
Report 4 On Big Data-1
9 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Big Data Analytics
100% (1)
Big Data Analytics
3 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
84 pages
Data Mining Task Primitives and Major Issues
No ratings yet
Data Mining Task Primitives and Major Issues
18 pages
Unit 4
No ratings yet
Unit 4
27 pages
Week1-2
No ratings yet
Week1-2
24 pages
(eBook PDF) Data Mining Concepts and Techniques 3rd pdf download
100% (2)
(eBook PDF) Data Mining Concepts and Techniques 3rd pdf download
52 pages
Data Mining Applications and Feature Scope Survey
No ratings yet
Data Mining Applications and Feature Scope Survey
5 pages
Data warehousing and Data Mining Unit 1,2,3 Q and A
No ratings yet
Data warehousing and Data Mining Unit 1,2,3 Q and A
41 pages
Data Mining
100% (1)
Data Mining
29 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
13 pages
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Incident Response and Digital Forensics
No ratings yet
Incident Response and Digital Forensics
50 pages
c128 Ic
No ratings yet
c128 Ic
3 pages
Vardhaman College of Engineering: Discrete Mathematical Structures
No ratings yet
Vardhaman College of Engineering: Discrete Mathematical Structures
50 pages
Feed Forward Back-Propagation
No ratings yet
Feed Forward Back-Propagation
13 pages
AVL Tree Animation
100% (1)
AVL Tree Animation
20 pages
Computer Architecture Syllabus
No ratings yet
Computer Architecture Syllabus
2 pages
Hypertext, Hypermedia and Multimedia For Improving The Lnaguage Learning
No ratings yet
Hypertext, Hypermedia and Multimedia For Improving The Lnaguage Learning
12 pages
Vijeo Citect Non Equipment Tag List
No ratings yet
Vijeo Citect Non Equipment Tag List
5 pages
Fuzzy Logic Control: Lect 5 Fuzzy Logic Control Basil Hamed Electrical Engineering Islamic University of Gaza
No ratings yet
Fuzzy Logic Control: Lect 5 Fuzzy Logic Control Basil Hamed Electrical Engineering Islamic University of Gaza
96 pages
Introduction To Rate Monotonic Scheduling
100% (1)
Introduction To Rate Monotonic Scheduling
4 pages
Improve Communication Between Your C - C++ Applications and SAP Systems With SAP NetWeaver RFC SDK - Part 3: Advanced Topics
No ratings yet
Improve Communication Between Your C - C++ Applications and SAP Systems With SAP NetWeaver RFC SDK - Part 3: Advanced Topics
18 pages
How To Hide Parameters On Selection Screen
No ratings yet
How To Hide Parameters On Selection Screen
6 pages
Ch07 ETL Specification ToC
No ratings yet
Ch07 ETL Specification ToC
3 pages
Translations
No ratings yet
Translations
288 pages
Whitepaper Egnyte Security Architecture
No ratings yet
Whitepaper Egnyte Security Architecture
21 pages
Asynchronous IO With Boost - Asio - Michael Caisse - CppCon 2016 PDF
No ratings yet
Asynchronous IO With Boost - Asio - Michael Caisse - CppCon 2016 PDF
104 pages
Java Interview Questions
100% (1)
Java Interview Questions
14 pages
FDX SDK Pro Distribution Guide (Windows) SG1-0008M-005
No ratings yet
FDX SDK Pro Distribution Guide (Windows) SG1-0008M-005
2 pages
Working With Files Using FSO
No ratings yet
Working With Files Using FSO
17 pages
Importing Dynamic Images To The Crystal Report Without Database Overhead Using Visual Studio 2005 - ASP Alliance PDF
No ratings yet
Importing Dynamic Images To The Crystal Report Without Database Overhead Using Visual Studio 2005 - ASP Alliance PDF
10 pages
Presentation On Macros
No ratings yet
Presentation On Macros
21 pages
Application of Python
No ratings yet
Application of Python
5 pages
Decoder 2 To 4 With Enable
No ratings yet
Decoder 2 To 4 With Enable
2 pages
BookStore Project
No ratings yet
BookStore Project
4 pages
DynawoDocumentation PDF
No ratings yet
DynawoDocumentation PDF
151 pages
Fyit MP Practical Mannual Final
No ratings yet
Fyit MP Practical Mannual Final
28 pages
M.Jayaprakash: Objective
No ratings yet
M.Jayaprakash: Objective
3 pages