Question? Leave a message!

Data mining association rules algorithm

Mining Association Rules in Large Databases and mining association rules between sets of items in large databases
Dr.JakeFinlay Profile Pic
Published Date:22-07-2017
Website URL
Mining Association Rules in Large Databases WWW.ThesisScientist.comAssociation Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items Diaper  Beer, 1 Bread, Milk Milk, Bread  Eggs,Coke, 2 Bread, Diaper, Beer, Eggs Beer, Bread  Milk, 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not causality 5 Bread, Milk, Diaper, Coke WWW.ThesisScientist.comDefinition: Frequent Itemset  Itemset  A collection of one or more items  Example: Milk, Bread, Diaper TID Items  k-itemset  An itemset that contains k items 1 Bread, Milk  Support count () 2 Bread, Diaper, Beer, Eggs  Frequency of occurrence of an 3 Milk, Diaper, Beer, Coke itemset 4 Bread, Milk, Diaper, Beer  E.g. (Milk, Bread,Diaper) = 2 5 Bread, Milk, Diaper, Coke  Support  Fraction of transactions that I assume that itemsets are contain an itemset ordered lexicographically  E.g. s(Milk, Bread, Diaper) = 2/5  Frequent Itemset  An itemset whose support is greater than or equal to a minsup thresholdDefinition: Association Rule Let D be database of transactions  e.g.: Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F  Let I be the set of items that appear in the database, e.g., I=A,B,C,D,E,F  A rule is defined by X  Y, where XI, YI, and XY=  e.g.: B,C  E is a rule WWW.ThesisScientist.comDefinition: Association Rule TID Items  Association Rule 1 Bread, Milk  An implication expression of the 2 Bread, Diaper, Beer, Eggs form X  Y, where X and Y are 3 Milk, Diaper, Beer, Coke itemsets 4 Bread, Milk, Diaper, Beer  Example: 5 Bread, Milk, Diaper, Coke Milk, Diaper  Beer Example:  Rule Evaluation Metrics Milk,Diaper Beer  Support (s)  Fraction of transactions that contain both X and Y  (Milk,Diaper,Beer) 2 s 0.4  Confidence (c) T 5  Measures how often items in Y  (Milk, Diaper,Beer) 2 appear in transactions that c 0.67 contain X  (Milk,Diaper) 3 WWW.ThesisScientist.comRule Measures: Support and Confidence Customer Customer Find all the rules X  Y with buys both buys diaper minimum confidence and support  support, s, probability that a transaction contains X  Y  confidence, c, conditional Customer probability that a transaction buys beer having X also contains Y Let minimum support 50%, Transaction ID Items Bought and minimum confidence 2000 A,B,C 50%, we have 1000 A,C  A  C (50%, 66.6%) 4000 A,D  C  A (50%, 100%) 5000 B,E,F WWW.ThesisScientist.comExample TID date items_bought 100 10/10/99 F,A,D,B Remember: 200 15/10/99 D,A,C,E,B sup(X  Y) 300 19/10/99 C,A,B,E conf(X  Y) = sup(X) 400 20/10/99 B,A,D  What is the support and confidence of the rule: B,D  A  Support: 75%  percentage of tuples that contain A,B,D =  Confidence: number of tuples that contain A,B,D  100% number of tuples that contain B,DAssociation Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having  support ≥ minsup threshold  confidence ≥ minconf threshold Brute-force approach:  List all possible association rules  Compute the support and confidence for each rule  Prune rules that fail the minsup and minconf thresholds  Computationally prohibitiveMining Association Rules TID Items Example of Rules: 1 Bread, Milk Milk,Diaper  Beer (s=0.4, c=0.67) 2 Bread, Diaper, Beer, Eggs Milk,Beer  Diaper (s=0.4, c=1.0) 3 Milk, Diaper, Beer, Coke Diaper,Beer  Milk (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer Beer  Milk,Diaper (s=0.4, c=0.67) 5 Bread, Milk, Diaper, Coke Diaper  Milk,Beer (s=0.4, c=0.5) Milk  Diaper,Beer (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: Milk, Diaper, Beer • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements WWW.ThesisScientist.comMining Association Rules  Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support  minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset  Frequent itemset generation is still computationally expensive WWW.ThesisScientist.comFrequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Given d items, there ABCD ABCE ABDE ACDE BCDE d are 2 possible candidate itemsets ABCDEFrequent Itemset Generation  Brute-force approach:  Each itemset in the lattice is a candidate frequent itemset  Count the support of each candidate by scanning the database List of Transactions Candidates TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke M N 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke w  Match each transaction against every candidate d  Complexity O(NMw) = Expensive since M = 2 WWW.ThesisScientist.comComputational Complexity Given d unique items: d  Total number of itemsets = 2  Total number of possible association rules: d d k  d1 dk  R   k1 j1 k j   d d1  3 21 If d=6, R = 602 rules WWW.ThesisScientist.comFrequent Itemset Generation Strategies Reduce the number of candidates (M) d  Complete search: M=2  Use pruning techniques to reduce M Reduce the number of transactions (N)  Reduce size of N as the size of itemset increases  Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM)  Use efficient data structures to store the candidates or transactions  No need to match every candidate against every transaction WWW.ThesisScientist.comReducing Number of Candidates Apriori principle:  If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: X,Y : (X Y ) s(X ) s(Y )  Support of an itemset never exceeds the support of its subsets  This is known as the anti-monotone property of supportExample TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs s(Bread) s(Bread, Beer) 3 Milk, Diaper, Beer, Coke s(Milk) s(Bread, Milk) 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke s(Diaper, Beer) s(Diaper, Beer, Coke) WWW.ThesisScientist.comIllustrating Apriori Principle n nu ullll A A B B C C D D E E A AB B A AC C A AD D A AE E B BC C B BD D B BE E C CD D C CE E D DE E Found to be Infrequent A AB BC C A AB BD D A AB BE E A AC CD D A AC CE E A AD DE E B BC CD D B BC CE E B BD DE E C CD DE E A AB BC CD D A AB BC CE E A AB BD DE E A AC CD DE E B BC CD DE E Pruned A AB BC CD DE E supersetsIllustrating Apriori Principle Item Count Items (1-itemsets) Bread 4 Coke 2 Milk 4 Pairs (2-itemsets) Itemset Count Beer 3 Bread,Milk 3 Diaper 4 Bread,Beer 2 (No need to generate Eggs 1 Bread,Diaper 3 candidates involving Coke Milk,Beer 2 or Eggs) Milk,Diaper 3 Beer,Diaper 3 Minimum Support = 3 Triplets (3-itemsets) Itemset Count If every subset is considered, 6 6 6 Bread,Milk,Diaper 3 C + C + C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 WWW.ThesisScientist.comThe Apriori Algorithm (the general idea) 1. Find frequent 1-items and put them to L (k=1) k 2. Use L to generate a collection of candidate k itemsets C with size (k+1) k+1 3. Scan the database to find which itemsets in C k+1 are frequent and put them into L k+1 4. If L is not empty k+1  k=k+1  GOTO 2 R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. The Apriori Algorithm  Pseudo-code: C : Candidate itemset of size k k L : frequent itemset of size k k L = frequent items; 1 for (k = 1; L =; k++) do begin k C = candidates generated from L ; k+1 k // join and prune steps for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L = candidates in C with min_support (frequent) k+1 k+1 end return L ; k k  Important steps in candidate generation:  Join Step: C is generated by joining L with itself k+1 k  Prune Step: Any k-itemset that is not frequent cannot be a subset of a frequent (k+1)-itemset