Data mining association analysis

data mining association analysis basic concepts and algorithms and association analysis in data mining
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Data Mining Association Analysis: Basic Concepts and Algorithms © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items Diaper  Beer, 1 Bread, Milk Milk, Bread  Eggs,Coke, 2 Bread, Diaper, Beer, Eggs Beer, Bread  Milk, 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not causality 5 Bread, Milk, Diaper, Coke © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Definition: Frequent Itemset  Itemset – A collection of one or more items  Example: Milk, Bread, Diaper TID Items – k-itemset 1 Bread, Milk  An itemset that contains k items 2 Bread, Diaper, Beer, Eggs  Support count () 3 Milk, Diaper, Beer, Coke – Frequency of occurrence of an itemset 4 Bread, Milk, Diaper, Beer – E.g. (Milk, Bread,Diaper) = 2 5 Bread, Milk, Diaper, Coke  Support – Fraction of transactions that contain an itemset – E.g. s(Milk, Bread, Diaper) = 2/5  Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Definition: Association Rule  Association Rule TID Items – An implication expression of the form 1 Bread, Milk X  Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs – Example: 3 Milk, Diaper, Beer, Coke Milk, Diaper  Beer 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke  Rule Evaluation Metrics – Support (s) Example:  Fraction of transactions that contain both X and Y Milk,Diaper Beer – Confidence (c)  (Milk,Diaper,Beer) 2  Measures how often items in Y s 0.4 appear in transactions that T 5 contain X  (Milk, Diaper,Beer) 2 c 0.67  (Milk,Diaper) 3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Association Rule Mining Task  Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold  Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Mining Association Rules Example of Rules: TID Items Milk,Diaper  Beer (s=0.4, c=0.67) 1 Bread, Milk Milk,Beer  Diaper (s=0.4, c=1.0) 2 Bread, Diaper, Beer, Eggs Diaper,Beer  Milk (s=0.4, c=0.67) 3 Milk, Diaper, Beer, Coke Beer  Milk,Diaper (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer Diaper  Milk,Beer (s=0.4, c=0.5) 5 Bread, Milk, Diaper, Coke Milk  Diaper,Beer (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: Milk, Diaper, Beer • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Mining Association Rules  Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support  minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset  Frequent itemset generation is still computationally expensive © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Given d items, there d are 2 possible candidate itemsets ABCDE © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Frequent Itemset Generation  Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database List of Transactions Candidates TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke M N 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke w – Match each transaction against every candidate d – Complexity O(NMw) = Expensive since M = 2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Computational Complexity  Given d unique items: d – Total number of itemsets = 2 – Total number of possible association rules: d dk  d1 dk  R   k1 j1 k j   d d1  3 21 If d=6, R = 602 rules © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Frequent Itemset Generation Strategies  Reduce the number of candidates (M) d – Complete search: M=2 – Use pruning techniques to reduce M  Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms  Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Reducing Number of Candidates  Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent  Apriori principle holds due to the following property of the support measure: X,Y : (XY )s(X )s(Y ) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Illustrating Apriori Principle n nu ullll A A B B C C D D E E A AB B A AC C A AD D A AE E B BC C B BD D B BE E C CD D C CE E D DE E Found to be Infrequent A AB BC C A AB BD D A AB BE E A AC CD D A AC CE E A AD DE E B BC CD D B BC CE E B BD DE E C CD DE E A AB BC CD D A AB BC CE E A AB BD DE E A AC CD DE E B BC CD DE E Pruned A AB BC CD DE E supersets © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Illustrating Apriori Principle Items (1-itemsets) Item Count Bread 4 Coke 2 Milk 4 Pairs (2-itemsets) Itemset Count Beer 3 Bread,Milk 3 Diaper 4 Bread,Beer 2 (No need to generate Eggs 1 Bread,Diaper 3 candidates involving Coke Milk,Beer 2 or Eggs) Milk,Diaper 3 Beer,Diaper 3 Minimum Support = 3 Triplets (3-itemsets) Itemset Count If every subset is considered, 6 6 6 Bread,Milk,Diaper 3 C + C + C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Apriori Algorithm  Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified  Generate length (k+1) candidate itemsets from length k frequent itemsets  Prune candidate itemsets containing subsets of length k that are infrequent  Count the support of each candidate by scanning the DB  Eliminate candidates that are infrequent, leaving only those that are frequent © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Reducing Number of Comparisons  Candidate counting: – Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure  Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets Hash Structure Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs k 3 Milk, Diaper, Beer, Coke N 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Buckets © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Generate Hash Tree Suppose you have 15 candidate itemsets of length 3: 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8 You need: • Hash function • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 2 3 4 Hash function 5 6 7 3,6,9 1,4,7 3 6 7 1 4 5 3 5 6 3 4 5 1 3 6 3 6 8 3 5 7 2,5,8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 2 3 4 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 6 7 Hash on 3 5 7 3 6 8 1, 4 or 7 6 8 9 1 2 4 1 5 9 1 2 5 4 5 7 4 5 8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 2 3 4 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 6 7 Hash on 3 5 7 3 6 8 2, 5 or 8 6 8 9 1 2 4 1 5 9 1 2 5 4 5 7 4 5 8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 2 3 4 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 6 7 Hash on 3 5 7 3 6 8 3, 6 or 9 6 8 9 1 2 4 1 5 9 1 2 5 4 5 7 4 5 8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›