0% found this document useful (0 votes)
43 views15 pages

Unit-5 Finalized

Uploaded by

MALLIKARJUN Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
43 views15 pages

Unit-5 Finalized

Uploaded by

MALLIKARJUN Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 15
WED Data Warehousing and Data Mining Reference Note Unit-5 Mining Frequent Patterns Frequent Patterns Frequent pattems are patterns (itemsets, subsequences, or substructures) that appear ina data set frequently. — For example, a set of items, such as milk and bread that appear frequently together in a transaction data set is a frequent itemset. = A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. — A substructure can refer to different structural forms, such as subgraphs or subtrees, which may be combined with itemsets or subsequences. If a substructure occurs frequently in a graph database, itis called a (frequent) structural pattern, Finding frequent pattems plays an essential role in mining associations, correlations, and many other interesting relationships among data. Moreover, it helps in data indexing, classification, clustering, and other data mining tasks as well. Market Basket Analysis Market basket analysis analyzes customer buying habits by finding associations between the different items that customer place in their shopping baskets. The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. Such information can lead to increased sales by helping retailers do selective marketing and design different store layouts. SSopping Baskets a Vessel ‘Customer 3 sugar Market Analyst a Customer Items that are frequently purchased together can be placed in proximity in order to further encourage the sale of such items together. Market basket analysis can also help retailers plan which items to put on sale at reduced prices. If customers tend to purchase milk and bread together, then having a sale on milk may encourage the sale of milk-as well as bread. Collegenote Prepared By: Jayanta Poudel WE Data Warehousing and Data Mining Reference Note Frequent Itemsets = Itemsets: An itemset is a set of items. E.g. X= {milk, bread, coke} is an itemset. 1 Bread, Mik 2 Bread, Diaper, Beer, Eggs 3. Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5S Bread, Milk, Diaper, Coke = Anitemset that contains é items is a k-itemset. Ex. The set fmilk, bread} is a 2-itemset. + Support Count (9): Frequency of occurrence of an itemset. Eg. a(bread, milk) = 3 + Frequent itemset: Items that frequently appear together. E.g. {bread, milk}. An itemset ‘whose support is greater than or equal to a minimum support threshold (min_sup) is called frequent itemset, Closed Itemsets An itemset X is closed in a data set S if there exists no proper super-itemset Y such that Y has the same support count as X in S. For example, temset_ [Support] __|_itemset_|Support| AY 4 ec} | 2 8} 5 | | {ABD} 3 | © 3 ACD} | 2 O} 4 BOO} | 3 AB) | 4 | ABC. | 2 ac | 2 (AD) | 3 BC} 3 4 3 Here, {B} is closed itemset becanse it all supersets have less support count than the itemset {B}. But other itemsets such as {A}, {C}, and {D} is not closed itemsets. Here, item set (A, B} is also closed, because its superset {A, B, D} and {A, B, C, Dj has less support count than {A, B} Association Rules Association rules are ifithen statements that help uncover relationships between seemingly unrelated data in a relational database or other information repository. It has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data, A consequent is an item that is found in combination with the antecedent. Let 1 = {Upp Ip, on mer In} be a set of items and D = {T;,To, Tyn} be a Set of transactions called the darabase. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rufe is defined as an implication of the form: XY Where X,¥ fand XY =6 An example rule for the supermarket could be {butter, bread} = {milk} meaning that if a customer purchases butter and bread then he/she is also likely to buy milk. Collegenote Prepared By: Jayanta Poudel WE Data Warehousing and Data Mining Reference Note Support and Confidence To measure the interestingness of association rules two measures are used: + Support: Support of association rule X = ¥ is the percentage of transactions in dataset that contain both items (X & Y). In formula, a(X UY) Support(X = Y) =P(XUY) = No.of trans For example, 1 Bread, Mik 2 Bread, Diaper, deer, Eegs 3 Mik, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk Diaper, Coke Tn above dataset, Support(bread = milk) = 2 since both items ocenrs in 60% of all transactions. = Confidence: Confidence of association rule X > ¥ is the percentage of transactions containing X that also contain Y. In formila, P(XUY) _o(XuY Confidence(X = Y) = P(X) = ren oe ) For example, In above dataset, fidence(bread = milk) =; , since 75% of all transactions containing bread also contains milk. Association Rule Mining Association rule mining is a method for discovering frequent pattems in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness Association rule mining consists of two sub-processes: finding frequent item sets and generating association rules from those item sets. Frequent itemset: Frequent itemset is a set of items whose support is greater than the user specified minimum support. + Association rule: An association rule must satisfy user-set minimum support (min_sup) and minimum confidence (min_conf). The rule X = ¥ is called a strong association rule if support > min_sup and confidence > min_conf. Finding Association Rules |__ Find all Frequent Itemsets sing Minimum Support Find Asociation rules from Frequent Nemsets Using Minimam Confidence Flgwe 1: Generating Assocation Rules Collegenote Prepared By: Jayanta Poudel WED Data Warehousing and Data Mining Reference Note Types of Association Rule 1. Single Dimensional Association Rule It contains a single predicate (e.g. purchase) with its multiple occurrences (i.e. the predicate occurs more than once within the rule). For example: purchase(X, "milk") = purchase(X, "bread") 2. Multisdimensionat Association Rule It contains two or more predicates. Each predicate occurs only once. For example: age(X,"19 — 25") A occupation(X, "student") = buys(X,"laptop") 3. Hybrid dimensional Association Rule It is a multidimensional association rule with repeated predicates, which contain multiple occurrences of some predicates. For example: age(X,"19 — 25") A buys(X, "laptop") = buys(X,"printer") 4. Multilevel Association Rule Association nules generated from mining data at multiple levels of abstraction are called unultiple-level or multilevel association rules. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework. ‘Two Approaches of Multilevel Association Rules - Using uniform minimum support for all levels (referred to as uniform support): The same minimum support threshold is used when mining at each level of abstraction. - Using reduced mininum support at lower levels (referred to as reduced support): Each level of abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the corresponding threshold is ‘Uniform Support Reduced Support agi Y Computer ett [support = 10%] min_sup = 5% Level? Laptop Computer | Desktop Computer Level? min_sup = % | [support = 6%] | | [support= 4%] | min sw ~3% 5. Quantitative Association Rule Database attributes can be categorical or quantitative. Categorical attributes have a finite number of possible values, with no ordering among the values (e.g., occupation, brand, color). Quantitative attributes are numeric and have an implicit ordering among values (¢.g., age, income, price). Association rules mined from quantitative attributes are referred as quantitative association rules. Different from general association rules where both the left-hand and right-hand sides of the rule should be categorical attributes, at least one attribute of the quantitative association rule (left or right) must involve a numerical attribute. For example: Age(x,"20 — 30)*salary(x,"50 — 60k") + buys(x,“laptop”) Collegenote Prepared By: Jayanta Poudel WE Data Warehousing and Data Mining Reference Note Apriori Al igorithm Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the given database — Apriori Property: All subsets of a frequent itemset must be frequent. — Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generateditested. Apriori employs an iterative approach known as a level-wise search, where frequent k-itemsets are used fo explore fiequent (k+1)-itemsets. First, the set of frequent 1-itemsets that satisfy ‘minimum support is found by scanning the database. The resulting set is denoted Li. Next, Li is used to find L2, the set of frequent 2-itemsets, which is used to find Ls, and so on, until no ‘more frequent k-itemsets can be found. The finding of each Lx requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used. Pseudo-code: Cy: Candidate itemset of size k Ly: frequent itemset of size k L, = {frequent items}; for (k = 1; Ly! = 0; k ++)do begin Cea ~ candidates generated from Lx: for each transaction t in database do increment the count of all candidates in Cj, that are contained in ¢ Leva candidates in Cy; with min_support End return Ug Ly; Example Min_support = 2 Database D Note: (1,2,3)(1,2,5) and (1,3,5) not in Cy Collegenote Prepared By: Jayanta Poudel QE Dara Warehousing and Data Mining Reference Note Generating As Rules From Frequent Itemsets For each frequent itemset, f, generate all non-empty subsets of F For every non-empty subset s of f do gy ig BERET output rules s = (f — s) if SMD > min confidence end From above example Frequent itemset = (2,3, 5} Now, Rules Confidence = I Al, I, bl, Ig > I AL. b Aly > I bh Als > 1 Als = Ia As the given threshold or minimum confidence is 75%, so the mules 1 Aly > Is and Jy Als > Ip can be considered as the strong association rules for the given problem. Examples ‘Meal Items Order 1 M,, MMs Order 6 Mz, M3 Order 2 Mz My Order 7 My My Order 3 Mz, My Order 8 ‘My, My, My, M Order 4 My, M,, My Order 9 ‘My, Mz, Mz Order 5 My zt Solution: Step 1: Cateutating Cy and Ly Candidate 1-itemsets (C,) Frequent L-itemsets (1) Ttemset ‘Sup_count Ttems ‘Sup_count {My} 6 {My} 6 {Mo} 7 {Mo} 7 M3} 6 M3} 6 {Ms} 2 {Ma} 2 {Ms} 2 {Ms} 2 Collegenote Prepared By: Jayanta Poudel Data Warehousing and Data Mining Reference Note Step 2: Caleutating Cz and Lz Candidate 2-itemsets (C2) Frequent 2-itemsets (Ly) Ttemset ‘Sup_count Ttems Sup_count {Mi, Mz } 4 {My, Ma } 4 M,,M } 4 {Mi Ms } 4 {Mi Ma 1 My, Ms } 2 {MM 2 {Mz, Ma} 4 Ma, Ma} 4 (Mz, My 2 {Mp Mg} 2 (Mz, Ms} 2 {Mz Ms} 2 ‘Mz, Mg} 0 {Mg, Ms} 1 {My Ms) 0 Step 3: Cateulating C3 and L Candidate 3-itemsets (C3) Frequent 3-itemsets (L3) Ttemset ‘Sup_count Ttems ‘Sup_count {M,,M,,Ma} 2 {M,,M,,M. 2 My, Mp, M, 2 My, Mp,Ms } 2 Step 4: Cateulating Cy and Ly A candidate set of 4-itemsets, C, is {My, Mg, Mg, Ms }. This item set is praned since its subset (Mz, Ms, Mg} is not frequent. Thus C, = @, and algorithm terminates, + Frequent Itemsets = {M,,Mp,Mg } and {M,, Mp, Ms } Now, Generating Association Rules from Frequent Itemsets: Taking {M,,M, M3 Taking {M,, Mz, Ms } Rules Confidence Rules Confidence M, > Mz, AM, |_2/6=0333 M, > M, AM, M, > M, AM, |_2/7=0.285 M, > M, AM, Mz > M, AM, |__2/6=0.333 ‘Mz > M, AM. | Mz a Mg => My LM, AMs = M, | M, AM; = M; M, AM; = M,; M, AM, = M, M, AM, = M, As the given threshold or minimum confidence is 0.7, so the rules M; => M,AM,, M,AMs= M, and M,AMs=M, can be considered as the strong association rules for the given problem. Collegenote Prepared By: Jayanta Poudel WED Data Warehousing and Data Mining Reference Note TID Items_bought T100 {M,0.N,K,E,¥} 7200 {D,0,N,K,E,Y} 7300 {M, A, K,E} T400 {M,U, C,K,Y} Solution: Given, Min sup = 60% = 5 x = Min conf = 80% 7500 " 0,0,K,1,E) Step 1: Cateutating C, and Ly Candidate 1-itemsets (C,) Ttemset ‘Sup_count {M} 3 {0} {N} {Kj E} v7} DS {A} w. iC} Aly fro} a] ala foofn| unfea| un Step 2: Cateulating C2 and Lz Frequent 1-itemsets (Ly) Itemset_| Sup_count My 3 {0} 3 AK} 5 E 4 na 3 Candidate 2-itemsets (C2) Frequent 2-itemsets (Ly) Ttemset | Sup_count Itemset_| Sup_count {M,O} 1 {MK} 3 {M, Ky 3 (0, K} 3 {M, E} 2 {0,E} 3 {M,Y} 2 {K,E} 4 {0,K} 3 KY} 3 0.8 3 {0,¥} 2, {KE} 4 {K,Y} 3 {EY} 2 Collegenote Prepared By: Jayanta Poudel WED) Data Warehousing and Data Mining Reference Note Step 3: Calculating C3 and Ls Candidate 3-itemsets (C3) Frequent 3-itemsets (Ls) Ttemset ‘Sup_count Ttemset Sup_count | {0,K,E} 3 {0.K.E} 3 Now, stop since no more combinations can be made in Ls. «Frequent Itemset = (0, K,E} Now, Generating Association Rules from Frequent Itemsets: Rules Confidence O=KAE K>OAE E SOAK OAKSE OnESK KAE=0 As the given threshold or minimum confidence is 80%, so the rules 0 > KANE, OAK = E and OAE > K can be considered as the strong association rules for the given problem. Advantages of Apriori Aigoritim Limitations of Apriori Algorithm - Thisis easy to understand algorithm. — |- Using Apriori needs a generation of - This algorithm has least memory | candidate itemsets. These itemsets may consumption, be large in number if the itemset in the database is huge - Easy implementation. - Apriori needs multiple scans of the database to check the support of each itemset generated and this leads to high costs - It uses Apriori property for pruning therefore, itemsets left for further support checking remain less. These Limitations can be overcome using the EP growth algorithm. Methods to Improve Apriori’s Efficiency = Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent. + Transaction reduction: A transaction that does not contain any frequent K-itemset is useless in subsequent scans + Partitioning: An itemset that is potentially frequent in DB nmust be frequent in at least one of the partitions of DB. + Sampling: Mining on a subset of given data, lower support threshold and a method to determine completeness. + Dynamic itemset counting: Add new candidate itemsets only when all of their subsets are estimated to be frequent. Collegenote Prepared By: Jayanta Poudel WED) Data Warehousing and Data Mining Reference Note Frequent Pattern (FP) Growth Algorithm This algorithm is an improvement to the Apriori method. A frequent pattem is generated without the need for candidate generation. FP growth algorithm represents the database in the form of a tree called a frequent pattern tree or FP tree. This tree structure will maintain the association between the itemsets. In FP Growth there are mainly two steps: ‘Step 1: Construction of FP-tree It is built using two passes over the data-set. Pass 1: Scan data and find support for each item, + Discard infrequent items. + Sort frequent items in decreasing order based on their support Pass 2: + Create the root of the tree, labeled with null + Scan database again. Process items in each transaction in descending order of support count and create branch for each transaction + Ifsome transaction shares common prefix with already processed transactions, increment counter for each item in common prefix and create branches for items that are not common, + To facilitate tree traversal, an item header table is also built so that each item points to its, ocourrences in the tree via a chain of node links. Step 2: Extract Frequent Item-sets from FP-tree = Construct its conditional pattern base, which is subset of database that contains the set of prefix paths in the FP-tree. = Construct its (conditional) FP-tree. Conditional FP-tree of item I is FP-tree constructed only by considering transaction that contains item I and then removing the Item I from all transactions. Then perform mining recursively on such a tree. The pattem growth is achieved by the concatenation of the suffix pattern with the frequent pattems generated from a conditional FP-tree. Example TID ems TL BAT 12 AC T3 A,S T4 BAC TS BS 16 A,S 17 BS 18 | B,A,S,T 19 »A,S Collegenote Prepared By: Jayanta Poudel Data Warehousing and Data Mining Reference Note Solution: First of all, we need to create atable of item counts in the whole transactional database and sort the item list in descending order for their support count as below: Ttem | Support Count B 6 A 7 T 2 c 2 Ss 6 Tt Si rt Count Sort frequent oe items in frequency, | $ descending oder > |g 6 c 2 T 2 ‘Now, for each transaction, the respective Ordered-Item set is built TD Ttems | Ordered Itemsets TL B.A.T A.B.T T2 A.C A.C BB A.S AS Ta | B.A.C AB.C TS B.S BS 6 A,S AS 17 B,S B.S Ts [B.A.\S.T| ABST To B,A,S A,B,S Now building FP-Tree by using ordered itemsets one by one: 1. After reading TID = TI 2. After reading TID =T2 3. Aflerreading TID = T3 © © Collegenote Prepared By: Jayanta Poudel Data Warehousing and Data Mining Reference Note 4. Afier reading TID = T4 5. Afler reading TID = TS DY Kz e SOO © 60 6. Affer reading TID = T6 7. Afler reading TID = T7 Collegenote Prepared By: Jayanta Poudel Data Warehousing and Data Mining Reference Note 8. Afier reading TID = TS 9. Affer reading TID = T9 © GQ Of] & QOOO!] GOOO OOO S00 © © Now, to facilitate tree traversal, an item header table is built so that each item points to its occurrences in the tree via a chain of node-links. Fe Count_| Node Link Ta Collegenote Prepared By: Jayanta Poudel Data Warehousing and Data Mining Reference Note Now we need to construct a table for conditional pattern base and hence, the frequent pattern generation. Trem | Conditional Pattern Base | Conditional FP-tree | Frequent Pattern Generation T | (A.B: 1}, {A.B,S:1}} _[ (A, T:2}, {B,T: 2}, {4.B, T:2) C__[4A.B:1), {A:T} {A,C:2) S {{A, B: 2}, {A: 2}, {B: 2}} | , | {A, S: 4}, {B, S: 4}, {A, B, S: 2} B_ | {{A:4}) {A.B} Ais directly from the Null node and since there aren’t any in-between nodes to reach A, there is no need to go for another row of A We have generated a 3-item frequent set as {A, B, T 2} & (A,B, S: 2}. Similarly 2-Item frequent sets are {A, T: 2},{B, T: 2}, (A, C: 2}, {A, S: 4},{B, S: 4} & {A, B:4} Generate Association Rules: Same as in Apriori algorithm. Advantages of FP growth algorithm + Itallows frequent item set discovery without candidate generation. + Itbuilds a compact data structure called FP tree = Only two passes over dataset. + Faster than Apriori algorithm. Di CEP growth algori + FP treeis difficult to build than Aprioti + PP tree may not fit in memory. + PP tree is expensive to build. Apriori vs FP Growth FP Growth Apriori Ttis slower than FP growth algorithm. Itis faster than Aprioni algorithm. tis an array-based algorithm. Itis a tree-based algorithm, Ttuses Apriori join and prune property. It constructs conditional frequent pattem tree and conditional pattem base from database which satisfy minimum support. ‘Aption utilizes a Tevel-wise approach where it generates patterns containing | item, then 2 items, then 3 items, and so on. FP Growth utilizes a pattem-growth approach means that, it only considers patterns actually existing in the database Tt requires large memory space due to large mumber of candidate generation. Tt scans the database multiple times for generating candidate sets It requires less memory space due to compact structure and no candidate generation. It scans the database only twice for constructing frequent pattem tree. Candidate generation is very parallelizable Data are very interdependent, each node needs the root Collegenote Prepared By: Jayanta Poudel WES Data Warehousing and Data Mining Reference Note From Association Mini ing to Correlation Analysis (Lift) Most association rule mining algorithms employ a support-confidence framework. Often, many interesting rules can be found using low support thresholds. Although minimum support and confidence thresholds help weed out or exclude the exploration of a good number of uninteresting rules, many mules so generated are still not interesting to the users. The support and confidence measures are insufficient at filtering out uninteresting association rules. To tackle this weakness, a correlation measure can be used to augment the support- confidence framework for association rules. This leads to correlation rules of the form. A> B (support, confidence, correlation] That is, a correlation rule is measured uot only by its support aud confidence but also by the correlation between itemsets 4 and B. There are many different correlation measures from which to choose. Lift s simple correlation measure that is given as follows: Lift: The lift between the occurrence of A and B can be measured as below: P(AUB) lifi(A, B) = Fae = If the lif t(A, B) is less than 1, then the occurrence of A is negatively correlated with the ocenmtence of B. = Ifthe resulting value is greater than 1, then A and B are positively correlated, meaning that the occurrence of one implies the occurrence of the other - Ifthe resulting value is equal to 1, then A and B are independent and there is no correlation between them, Please let me know if I missed anything or anything is incorrect. [email protected] Collegenote Prepared By: Jayanta Poudel

You might also like