Report of 2nd Defence
Report of 2nd Defence
University School of Information Technology GGS Indraprastha University Kashmere Gate, Delhi 6
(2008-2011)
Abstract
This research work proposes an improved Apriori algorithm to minimize the number of candidate sets while generating association rules by evaluating quantitative information associated with each item that occurs in a transaction, which was usually, discarded as traditional association rules focus just on qualitative correlations. The proposed approach reduces not only the number of item sets generated but also the overall execution time of the algorithm. Any valued attribute will be treated as quantitative and will be used to derive the quantitative association rules which usually increases the rules' information content. Transaction reduction is achieved by discarding the transactions that does not contain any frequent item set in subsequent scans which in turn reduces overall execution time. Dynamic item set counting is done by adding new candidate item sets only when all of their subsets are estimated to be frequent. The frequent item ranges are the basis for generating higher order item ranges using Apriori algorithm. During each iteration of the algorithm, use the frequent sets from the previous iteration to generate the candidate sets and check whether their support is above the threshold. The set of candidate sets found is pruned by a strategy that discards sets which contain infrequent subsets. This work evaluates the scalability of the algorithm by considering transaction time, number of item sets used in the transaction and memory utilization. Quantitative association rules can be used in several domains where the traditional approach is employed.
Introduction
Association means grouping of related items from a set. A simple example is analyzing a large database of supermarket transactions with the aim of finding association rules. This is called Association rules mining or market basket analysis. Association rule mining is used to: Find frequent patterns. Find associations. Find correlations.
Among the set of items or objects in transactional databases, relational databases. This can be used by the retailers, entrepreneur in order to make any advertisement, improvement in their business. The market based analysis find customers purchasing
habits. This analysis is done onto the customer basket to identify the frequent combination of products. Market Basket Analysis is a technique that assists in understanding what items are likely to be purchased together according to the association rules, primarily with the aim of identifying cross-selling opportunities. A super market can use this technique to organize and place products frequently sold together into the same area. The direct marketers can use the MBA to find what new products to offer their customers. The application of market basket analysis is generally facilitated by the use of the data mining tools. Using this analysis product in demand can be identified by marketers and "combined take rates" of the products can be known. The combined take rates are defined as - how often the items are bought together. In a data base, this can be answered with a query. When there are 100 products, it will take thousands of queries to get the "most popular basket". Association rule proposed the supportconfidence measurement framework and reduced association rule mining to the discovery of frequent item sets. Two basic Entities of association rule are: Support-> Support is a measure of what fraction of thepopulation satisfies both the antecedent and the consequent ofthe rule. Confidence->Confidence is a measure of how often the consequent is true when the antecedent is true. Five different algorithms are used in development of association rules. They are AIS, SETM, Apriori, AprioriTID,Apriori Hybrid. Ex: Transaction Id 1 2 3 4 Items A,B,C A,C A,D B,E,F
Let the minimum support and minimum confidences both are 50%, we have the following association rule: A -> C (50%, 66.6%)
C -> A (50%, 100%) For A->C, the 66.6% means that customer buys A also have 66.6% chance tends to buy C. With C->A, customer buys C is 100% tends to buy A. Equation for support and confidence: Support (P->Q) = Probability (PQ) Confidence (P->Q) = Probability (Q/P) Think the transaction database in the previous page, for rule A->C: Support ({A, C}) = 2 / 4 * 100% = 50% Confidence = Support ({A, C})/ Support ({A}) = 50% /75% = 66.6%
obtain frequent itemsets L1 = { {I1}, {I2}, {I3}, {I4}, {I5} }; generate candidate itemsets C2 = { {I1I2}, {I1I3}, {I1I4}, {I1I5}, {I2I3}, {I2I4}, {I2I5}, {I3I4}, {I3I5}, {I4I5} }, scan D for count of each candidate itemset, which support are 6/10, 6/10, 2/10, 3/10, 5/10, 3/10, 3/10, 1/10, 2/10, 1/10, respectively, remove those itemsets which support are lower than 3/10, we can easily get L2 = {{I1I2}, {I1I3}, {I1I5}, {I2I3}, {I2I4}, {I2I5} }; generate candidate itemsets C3 = {{I1I2I3}, {I1I2I5}, {I1I3I5}, {I2I3 I4}, {I2I3I5}, {I2I4I5}}, scan D for count of each candidate itemset, which support are 4/10, 3/10, 2/10, 1/10, 2/10, 1/10 respectively, then we can gain the frequent itemsets L3 = { {I1I2I3},
{I1I2I5} }, then C4 = { {I1I2I3I5} },with support is 2/10 lower than minimum support threshold and the finding process for frequent itemset will be finished.
Literature Survey
This section presents a comprehensive survey, mainly focused on the study of research methods for mining the frequent itemsets and association rules with utility considerations. Most of the existing works paid attention to performance and memory perceptions. The AIS (Agrawal, Imielinski, Swami) algorithm put forth by Agrawal was the forerunner of all the algorithms used to generate the frequent itemsets and confident association rules, the description of which has been given along with the introduction of mining problem. The algorithm comprises of two phases. The first phase constitutes the generation of the frequent itemsets. This is followed by the generation of the confident and frequent association rules in the second phase. The exploitation of the monotonicity property of the support of itemsets and the confidence of association rules led to the enhancement of the algorithm and it was renamed Apriori in a later point of time by Agrawal. Though a number of algorithms were put forth following the introduction of Apriori algorithm, a majority of them dealt with the optimization of one or more steps of the Apriori bearing the similar general structure. Alongside Apriori, Agrawal proposed the AprioriTid and AprioriHybrid algorithms as well. Apriori outperforms AIS on problems of various sizes. It beats by a factor of two for high minimum support and more than an order magnitude for low levels of support. SETM (SET-oriented Mining of association rules) was constantly outperformed by AIS. AprioriTid performed equivalently well as Apriori for smaller problem sizes however performance degraded twice slow when applied to large problems.
The support counting procedure of the Apriori algorithm has attracted voluminous research owing to the fact that the performance of the algorithm mostly relies on this aspect. Park et al. proposed an optimization, called DHP (Direct Hashing and Pruning) intended towards restricting the number of candidate itemstes, shortly following the Apriori algorithms mentioned above. Brin et al put forth the DIC algorithm that partitions the database into intervals of a fixed size so as to reduce the number of traversals through the database. Another algorithm called the CARMA algorithm (Continuous Association Rule Mining Algorithm) employs an identical technique in order to restrict the interval size to 1. A methodology that is entirely different from that of the aforesaid ones was proposed by Savasere. In this case, the vertical data base layout comes into action while storing the database in main memory besides the computation of an itemset being done with the intersection of the covers of two of its subsets. The Eclat algorithm put forth by Zaki is considered to be the archetype in the depth first manner of generation of frequent itemsets. This was followed by the introduction of diverse depth first algorithms among which the FP-growth algorithm by Han is the most famous and widely used. The numerous algorithms available are categorized based on their attention towards the parameters: performance, memory and discussed briefly with comparison and other related works in the following sub-sections.