Data Mining and Database Systems Where Is The Intersection
Data Mining and Database Systems Where Is The Intersection
1 Introduction
The promise of decision support systems is to exploit enterprise data for competitive advantage. The process of deciding what data to collect and how to clean such data raises nontrivial issues. However, even after a data warehouse has been set up, it is often difcult to analyze and assimilate data in a warehouse. OLAP takes an important rst step at the problem by allowing us to view data multidimensionally as a giant spreadsheet with sophisticated visual tools to browse and query the data (See [3] for a survey). Data Mining promises a giant leap over OLAP where instead of a power OLAP user navigating data, the mining tools will automatically discover interesting patterns. Such functionality will be very useful in enterprise databases that are characterized by a large schema as well as large number of rows. Data Mining involves data analysis techniques that have been used by statisticians and machine learning community for quite some time now (generically referred to data analysts in this paper). This raises the question as to what role, if any, database systems research may contribute to area of data mining. In this article, I will try to present my biased view on this issue and argue that (1) we need to focus on generic scalability requirements (rather than on features tuned to specic algorithms) wherever possible and (2) we need to try to build data mining systems that are not just scalable, but SQL-aware.
class of sampling based algorithms can take advantage of such an approximate analysis. This is clearly an area that is rich with past work by AI/statistics community. To ensure that we avoid pitfalls in cutting corners that severely affect quality of analysis, a database systems person needs to carefully work with a data analysis expert.
Recently, we built a scalable classier [4] at Microsoft Research. Our approach exemplies one of the several ways in which the problem of building SQL-aware systems may be approached. We started with a classical mainmemory implementation of a decision tree classier and Microsoft SQL Server. We augmented this set-up with a middleware to enhance performance. In particular, we modied the in-memory classier such that it invokes the middleware whenever it needed to generate counts for each active node. In our rst implementation, our goal was to get the SQL backend to do all the work in generating counts. The implementation helped us identify key bottlenecks and led to implementation changes for optimal use of server functionality as well as led to needs for SQL extensions. In the rest of this section, I will briey discuss the issues related to exploiting the SQL backend as well as issues related to extensions to SQL for data mining. Where appropriate, I will draw examples from the decision-tree classier and association rule implementations.
is important in classication since while growing a decision-tree classier, for every active (non-leaf) node, we must collect the count of the number of tuples for every value of every attribute of the data table. This corresponds to a set of single-block aggregation queries where all the queries share the same From and Where clauses, i.e., differ only on Group By and Select clauses. Having the ability to do multi-statement optimization of the above set of related queries and to exploit a single data scan to evaluate them greatly speed up the classication algorithms. Such batch aggregation functionality goes beyond the CUBE operator [6]. Both sampling and batch aggregation strongly interact with core SQL primitives and thus with the SQL relational engine implementations. We distinguish the above set of new operators with those that do not strongly interact with core SQL but presents a set of useful encapsulated procedures, perhaps supported via a mining extender/cartridge/blade or simply via extended system-provided stored procedures (depending on the database vendor). The purpose of such a set of operators is to make it easy to develop new data mining applications. However, for the set of primitives in this class to be useful, it is important that the operations be unbundled so that they may be shared. This issue is best illustrated through a recent SQL extension that has been proposed for association rules [2]. In that proposal, an extension is proposed to generate association rules. However, note that an alternative would have been to consider specifying frequent itemsets instead. An association rule can be derived easily with frequent itemsets as the primitive. Furthermore, the construct for frequent itemset may be exploited more generally for different variants of association rules. Thus, building extenders that directly map one-on-one to individual data mining algorithms may not be ideal.
5 Conclusion
In this article, I have reviewed various facets of data mining. There is an opportunity to work closely with data analysts to develop approximations of classical data analysis that are scalable. There is a need to look for generic scalability extensions for each class of data mining algorithms, rather than for specic scalable algorithms in each class. Another important direction is to consider scalable implementations over SQL systems. Such an effort will lead not only to changes in scalable algorithms, but also lead to new extensions for SQL that will make it better suited to support a variety of data mining utilities. Finally, it is well understood that core data mining algorithms by themselves are not sufcient, but needs to be integrated with other database tools. In particular, it is necessary to augment them with visualization support, as has been done in OLAP. Acknowledgement We thank Umesh Dayal, Usama Fayyad, Goetz Graefe, and Jim Gray for many fruitful discussions.
References
[1] Agrawal R. et. al. Fast Dicovery of Association Rules, pp. 307-328 in [5]. [2] Meo R., P. Giuseppe, Ceri S., A new SQL-like Operator for Mining Association Rules, in Proc. of VLDB96, pp. 122-133, Mumbai, India. [3] Chaudhuri S., Dayal U. An Overview of Datawarehousing and OLAP Technology, in Sigmod Record, March 1997. [4] Chaudhuri S., Fayyad U., Bernhardt J. Scalable Classier over SQL Databases, in preparation. [5] Fayyad U. et. al. Advances in Knowledge Discovery and Data Mining, MIT Press, 1996. [6] Gray et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub Totals,in Data Mining and Knowledge Discovery, 1(1), pp. 29-53, 1997. [7] Han J. Towards On-Line Analytical Mining in Large Databases,to appear. [8] Sarawagi S., Thomas S., Agrawal R. Integrating Mining with Relational Database Systems: Alternatives and Implications, in Proc. of ACM Sigmod 98, To appear.