Associative classification in data mining ppt

classification and prediction in data mining ppt and classification of data mining systems ppt
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Classification July 20, 2017 www.ThesisScientist.com 1 1Sequence Data Timeline 10 15 20 25 30 35 Sequence Database: Object A: Object Timestamp Events 2 6 1 A 10 2, 3, 5 1 3 A 20 6, 1 5 A 23 1 B 11 4, 5, 6 B 17 2 Object B: B 21 7, 8, 1, 2 1 4 2 7 B 28 1, 6 6 5 8 6 1 C 14 1, 8, 7 2 Object C: 1 7 8 2 July 20, 2017 www.ThesisScientist.comExamples of Sequence Data Sequence Sequence Element Event Database (Transaction) (Item) Customer Purchase history of a given A set of items bought by Books, diary products, customer a customer at time t CDs, etc Web Data Browsing activity of a A collection of files Home page, index particular Web visitor viewed by a Web visitor page, contact info, etc after a single mouse click Event data History of events generated Events triggered by a Types of alarms by a given sensor sensor at time t generated by sensors Genome DNA sequence of a An element of the DNA Bases A,T,G,C sequences particular species sequence Element Event (Transaction) E1 E1 E3 (Item) E2 E2 E2 E3 E4 Sequence 3 July 20, 2017 www.ThesisScientist.comFormal Definition of a Sequence  A sequence is an ordered list of elements (transactions) s = e e e … 1 2 3  Each element contains a collection of events (items) e = i , i , …, i i 1 2 k  Each element is attributed to a specific time or location  Length of a sequence, s, is given by the number of elements of the sequence  A k-sequence is a sequence that contains k events (items) 4 July 20, 2017 www.ThesisScientist.comFormal Definition of a Subsequence  A sequence a a … a is contained in another sequence 1 2 n b b … b (m ≥ n) if there exist integers 1 2 m i i … i such that a b , a b , …, a b 1 2 n 1 i1 2 i1 n in Data sequence Subsequence Contain? 2,4 3,5,6 8 2 3,5 Yes 1,2 3,4 1 2 No 2,4 2,4 2,5 2 4 Yes  The support of a subsequence w is defined as the fraction of data sequences that contain w  A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup) 5 July 20, 2017 www.ThesisScientist.comSequential Pattern Mining: Definition  Given:  a database of sequences  a user-specified minimum support threshold, minsup  Task:  Find all subsequences with support ≥ minsup 6 July 20, 2017 www.ThesisScientist.comSequential Pattern Mining: Challenge  Given a sequence: a b c d e f g h i  Examples of subsequences: a c d f g , c d e , b g , etc.  How many k-subsequences can be extracted from a given n-sequence? a b c d e f g h i n = 9 k=4: Y _ _ Y Y _ _ _Y Answer : n 9  a d e i  126  k 4  7 July 20, 2017 www.ThesisScientist.comChallenges on Sequential Pattern Mining  A huge number of possible sequential patterns are hidden in databases  A mining algorithm should  find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold  be highly efficient, scalable, involving only a small number of database scans  be able to incorporate various kinds of user-specific constraints 8 July 20, 2017 www.ThesisScientist.comSequential Pattern Mining Algorithms  Concept introduction and an initial Apriori-like algorithm  Agrawal & Srikant. Mining sequential patterns, ICDE’95  Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal EDBT’96)  Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.KDD’00; Pei, et al.ICDE’01)  Vertical format-based mining: SPADE (ZakiMachine Leanining’00)  Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, ShimVLDB’99; Pei, Han, Wang CIKM’02)  Mining closed sequential patterns: CloSpan (Yan, Han & Afshar SDM’03) 9 July 20, 2017 www.ThesisScientist.comExtracting Sequential Patterns  Given n events: i , i , i , …, i 1 2 3 n  Candidate 1-subsequences: i , i , i , …, i 1 2 3 n  Candidate 2-subsequences: i , i , i , i , …, i i , i i , …, 1 2 1 3 1 1 1 2 i i n-1 n  Candidate 3-subsequences: i , i , i , i , i , i , …, i , i i , i , i 1 2 3 1 2 4 1 2 1 1 2 i , …, 2 i i , i , i i , i , …, i i i , 1 1 2 1 1 3 1 1 1 i i i , … 1 1 2 10 July 20, 2017 www.ThesisScientist.comGeneralized Sequential Pattern (GSP)  Step 1:  Make the first pass over the sequence database D to yield all the 1- element frequent sequences  Step 2: Repeat until no new frequent sequences are found  Candidate Generation:  Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items  Candidate Pruning:  Prune candidate k-sequences that contain infrequent (k-1)- subsequences  Support Counting:  Make a new pass over the sequence database D to find the support for these candidate sequences  Candidate Elimination:  Eliminate candidate k-sequences whose actual support is less than minsup 11 July 20, 2017 www.ThesisScientist.comCandidate Generation  Base case (k=2):  Merging two frequent 1-sequences i and i will produce 1 2 two candidate 2-sequences: i i and i i 1 2 1 2  General case (k2):  A frequent (k-1)-sequence w is merged with another frequent 1 (k-1)-sequence w to produce a candidate k-sequence if the 2 subsequence obtained by removing the first event in w is the same 1 as the subsequence obtained by removing the last event in w 2  The resulting candidate after merging is given by the sequence w extended with the last event of w . 1 2  If the last two events in w belong to the same element, then the 2 last event in w becomes part of the last element in w 2 1  Otherwise, the last event in w becomes a separate element 2 appended to the end of w 1 12 July 20, 2017 www.ThesisScientist.comCandidate Generation Examples  Merging the sequences w =1 2 3 4 and w =2 3 4 5 1 2 will produce the candidate sequence 1 2 3 4 5 because the last two events in w (4 and 5) belong to the same element 2  Merging the sequences w =1 2 3 4 and w =2 3 4 5 1 2 will produce the candidate sequence 1 2 3 4 5 because the last two events in w (4 and 5) do not belong to the same element 2  We do not have to merge the sequences w =1 2 6 4 and w =1 2 4 5 1 2 to produce the candidate 1 2 6 4 5 because if the latter is a viable candidate, then it can be obtained by merging w with 1 1 2 6 5 13 July 20, 2017 www.ThesisScientist.comGSP Example Frequent 3-sequences Candidate 1 2 3 Generation 1 2 5 1 5 3 Candidate 1 2 3 4 2 3 4 Pruning 1 2 5 3 2 5 3 1 5 3 4 3 4 5 2 3 4 5 5 3 4 1 2 5 3 2 5 3 4 14 July 20, 2017 www.ThesisScientist.comFinding Length-1 Sequential Patterns  Examine GSP using an example  Initial candidates: all singleton sequences Cand Sup  a, b, c, d, e, f, a 3 g, h b 5  Scan database once, count support for c 4 candidates d 3 e 3 min_sup =2 f 2 Seq. ID Sequence g 1 10 (bd)cb(ac) h 1 20 (bf)(ce)b(fg) 30 (ah)(bf)abf 40 (be)(ce)d 50 a(bd)bcb(ade) 15 July 20, 2017 www.ThesisScientist.comGSP: Generating Length-2 Candidates a b c d e f a aa ab ac ad ae af b ba bb bc bd be bf 51 length-2 c ca cb cc cd ce cf d da db dc dd de df Candidates e ea eb ec ed ee ef f fa fb fc fd fe ff a b c d e f Without Apriori a (ab) (ac) (ad) (ae) (af) b (bc) (bd) (be) (bf) property, c (cd) (ce) (cf) 88+87/2=92 d (de) (df) candidates e (ef) f Apriori prunes 44.57% candidates 16 July 20, 2017 www.ThesisScientist.comThe GSP Mining Process th Cand. cannot pass 5 scan: 1 cand. 1 length-5 seq. (bd)cba sup. threshold pat. th Cand. not in DB at all abba (bd)bc … 4 scan: 8 cand. 6 length-4 seq. pat. rd 3 scan: 47 cand. 19 length-3 seq. abb aab aba baa bab … pat. 20 cand. not in DB at all nd 2 scan: 51 cand. 19 length-2 seq. aa ab … af ba bb … ff (ab) … (ef) pat. 10 cand. not in DB at all st 1 scan: 8 cand. 6 length-1 seq. a b c d e f g h pat. Seq. ID Sequence 10 (bd)cb(ac) min_sup =2 20 (bf)(ce)b(fg) 30 (ah)(bf)abf 40 (be)(ce)d 50 a(bd)bcb(ade) 17 July 20, 2017 www.ThesisScientist.comCandidate Generate-and-test: Drawbacks  A huge set of candidate sequences generated.  Especially 2-item candidate sequence.  Multiple Scans of database needed.  The length of each candidate grows by one at each database scan.  Inefficient for mining long sequential patterns.  A long pattern grow up from short patterns  The number of short patterns is exponential to the length of mined patterns. 18 July 20, 2017 www.ThesisScientist.comThe SPADE Algorithm  SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001  A vertical format sequential pattern mining method  A sequence database is mapped to a large set of  Item: SID, EID  Sequential pattern mining is performed by  growing the subsequences (patterns) one item at a time by Apriori candidate generation 19 July 20, 2017 www.ThesisScientist.comThe SPADE Algorithm 20 July 20, 2017 www.ThesisScientist.com