Question? Leave a message!




Classification in data mining

Classification in data mining
Classification July 20, 2017 www.ThesisScientist.com 1 1Sequence Data Timeline 10 15 20 25 30 35 Sequence Database: Object A: Object Timestamp Events 2 6 1 A 10 2, 3, 5 1 3 A 20 6, 1 5 A 23 1 B 11 4, 5, 6 B 17 2 Object B: B 21 7, 8, 1, 2 1 4 2 7 B 28 1, 6 6 5 8 6 1 C 14 1, 8, 7 2 Object C: 1 7 8 2 July 20, 2017 www.ThesisScientist.comExamples of Sequence Data Sequence Sequence Element Event Database (Transaction) (Item) Customer Purchase history of a given A set of items bought by Books, diary products, customer a customer at time t CDs, etc Web Data Browsing activity of a A collection of files Home page, index particular Web visitor viewed by a Web visitor page, contact info, etc after a single mouse click Event data History of events generated Events triggered by a Types of alarms by a given sensor sensor at time t generated by sensors Genome DNA sequence of a An element of the DNA Bases A,T,G,C sequences particular species sequence Element Event (Transaction) E1 E1 E3 (Item) E2 E2 E2 E3 E4 Sequence 3 July 20, 2017 www.ThesisScientist.comFormal Definition of a Sequence  A sequence is an ordered list of elements (transactions) s = e e e … 1 2 3  Each element contains a collection of events (items) e = i , i , …, i i 1 2 k  Each element is attributed to a specific time or location  Length of a sequence, s, is given by the number of elements of the sequence  A ksequence is a sequence that contains k events (items) 4 July 20, 2017 www.ThesisScientist.comFormal Definition of a Subsequence  A sequence a a … a is contained in another sequence 1 2 n b b … b (m ≥ n) if there exist integers 1 2 m i i … i such that a b , a b , …, a b 1 2 n 1 i1 2 i1 n in Data sequence Subsequence Contain 2,4 3,5,6 8 2 3,5 Yes 1,2 3,4 1 2 No 2,4 2,4 2,5 2 4 Yes  The support of a subsequence w is defined as the fraction of data sequences that contain w  A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup) 5 July 20, 2017 www.ThesisScientist.comSequential Pattern Mining: Definition  Given:  a database of sequences  a userspecified minimum support threshold, minsup  Task:  Find all subsequences with support ≥ minsup 6 July 20, 2017 www.ThesisScientist.comSequential Pattern Mining: Challenge  Given a sequence: a b c d e f g h i  Examples of subsequences: a c d f g , c d e , b g , etc.  How many ksubsequences can be extracted from a given nsequence a b c d e f g h i n = 9 k=4: Y Y Y Y Answer : n 9  a d e i  126  k 4  7 July 20, 2017 www.ThesisScientist.comChallenges on Sequential Pattern Mining  A huge number of possible sequential patterns are hidden in databases  A mining algorithm should  find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold  be highly efficient, scalable, involving only a small number of database scans  be able to incorporate various kinds of userspecific constraints 8 July 20, 2017 www.ThesisScientist.comSequential Pattern Mining Algorithms  Concept introduction and an initial Apriorilike algorithm  Agrawal Srikant. Mining sequential patterns, ICDE’95  Aprioribased method: GSP (Generalized Sequential Patterns: Srikant Agrawal EDBT’96)  Patterngrowth methods: FreeSpan PrefixSpan (Han et al.KDD’00; Pei, et al.ICDE’01)  Vertical formatbased mining: SPADE (ZakiMachine Leanining’00)  Constraintbased sequential pattern mining (SPIRIT: Garofalakis, Rastogi, ShimVLDB’99; Pei, Han, Wang CIKM’02)  Mining closed sequential patterns: CloSpan (Yan, Han Afshar SDM’03) 9 July 20, 2017 www.ThesisScientist.comExtracting Sequential Patterns  Given n events: i , i , i , …, i 1 2 3 n  Candidate 1subsequences: i , i , i , …, i 1 2 3 n  Candidate 2subsequences: i , i , i , i , …, i i , i i , …, 1 2 1 3 1 1 1 2 i i n1 n  Candidate 3subsequences: i , i , i , i , i , i , …, i , i i , i , i 1 2 3 1 2 4 1 2 1 1 2 i , …, 2 i i , i , i i , i , …, i i i , 1 1 2 1 1 3 1 1 1 i i i , … 1 1 2 10 July 20, 2017 www.ThesisScientist.comGeneralized Sequential Pattern (GSP)  Step 1:  Make the first pass over the sequence database D to yield all the 1 element frequent sequences  Step 2: Repeat until no new frequent sequences are found  Candidate Generation:  Merge pairs of frequent subsequences found in the (k1)th pass to generate candidate sequences that contain k items  Candidate Pruning:  Prune candidate ksequences that contain infrequent (k1) subsequences  Support Counting:  Make a new pass over the sequence database D to find the support for these candidate sequences  Candidate Elimination:  Eliminate candidate ksequences whose actual support is less than minsup 11 July 20, 2017 www.ThesisScientist.comCandidate Generation  Base case (k=2):  Merging two frequent 1sequences i and i will produce 1 2 two candidate 2sequences: i i and i i 1 2 1 2  General case (k2):  A frequent (k1)sequence w is merged with another frequent 1 (k1)sequence w to produce a candidate ksequence if the 2 subsequence obtained by removing the first event in w is the same 1 as the subsequence obtained by removing the last event in w 2  The resulting candidate after merging is given by the sequence w extended with the last event of w . 1 2  If the last two events in w belong to the same element, then the 2 last event in w becomes part of the last element in w 2 1  Otherwise, the last event in w becomes a separate element 2 appended to the end of w 1 12 July 20, 2017 www.ThesisScientist.comCandidate Generation Examples  Merging the sequences w =1 2 3 4 and w =2 3 4 5 1 2 will produce the candidate sequence 1 2 3 4 5 because the last two events in w (4 and 5) belong to the same element 2  Merging the sequences w =1 2 3 4 and w =2 3 4 5 1 2 will produce the candidate sequence 1 2 3 4 5 because the last two events in w (4 and 5) do not belong to the same element 2  We do not have to merge the sequences w =1 2 6 4 and w =1 2 4 5 1 2 to produce the candidate 1 2 6 4 5 because if the latter is a viable candidate, then it can be obtained by merging w with 1 1 2 6 5 13 July 20, 2017 www.ThesisScientist.comGSP Example Frequent 3sequences Candidate 1 2 3 Generation 1 2 5 1 5 3 Candidate 1 2 3 4 2 3 4 Pruning 1 2 5 3 2 5 3 1 5 3 4 3 4 5 2 3 4 5 5 3 4 1 2 5 3 2 5 3 4 14 July 20, 2017 www.ThesisScientist.comFinding Length1 Sequential Patterns  Examine GSP using an example  Initial candidates: all singleton sequences Cand Sup  a, b, c, d, e, f, a 3 g, h b 5  Scan database once, count support for c 4 candidates d 3 e 3 minsup =2 f 2 Seq. ID Sequence g 1 10 (bd)cb(ac) h 1 20 (bf)(ce)b(fg) 30 (ah)(bf)abf 40 (be)(ce)d 50 a(bd)bcb(ade) 15 July 20, 2017 www.ThesisScientist.comGSP: Generating Length2 Candidates a b c d e f a aa ab ac ad ae af b ba bb bc bd be bf 51 length2 c ca cb cc cd ce cf d da db dc dd de df Candidates e ea eb ec ed ee ef f fa fb fc fd fe ff a b c d e f Without Apriori a (ab) (ac) (ad) (ae) (af) b (bc) (bd) (be) (bf) property, c (cd) (ce) (cf) 88+87/2=92 d (de) (df) candidates e (ef) f Apriori prunes 44.57 candidates 16 July 20, 2017 www.ThesisScientist.comThe GSP Mining Process th Cand. cannot pass 5 scan: 1 cand. 1 length5 seq. (bd)cba sup. threshold pat. th Cand. not in DB at all abba (bd)bc … 4 scan: 8 cand. 6 length4 seq. pat. rd 3 scan: 47 cand. 19 length3 seq. abb aab aba baa bab … pat. 20 cand. not in DB at all nd 2 scan: 51 cand. 19 length2 seq. aa ab … af ba bb … ff (ab) … (ef) pat. 10 cand. not in DB at all st 1 scan: 8 cand. 6 length1 seq. a b c d e f g h pat. Seq. ID Sequence 10 (bd)cb(ac) minsup =2 20 (bf)(ce)b(fg) 30 (ah)(bf)abf 40 (be)(ce)d 50 a(bd)bcb(ade) 17 July 20, 2017 www.ThesisScientist.comCandidate Generateandtest: Drawbacks  A huge set of candidate sequences generated.  Especially 2item candidate sequence.  Multiple Scans of database needed.  The length of each candidate grows by one at each database scan.  Inefficient for mining long sequential patterns.  A long pattern grow up from short patterns  The number of short patterns is exponential to the length of mined patterns. 18 July 20, 2017 www.ThesisScientist.comThe SPADE Algorithm  SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001  A vertical format sequential pattern mining method  A sequence database is mapped to a large set of  Item: SID, EID  Sequential pattern mining is performed by  growing the subsequences (patterns) one item at a time by Apriori candidate generation 19 July 20, 2017 www.ThesisScientist.comThe SPADE Algorithm 20 July 20, 2017 www.ThesisScientist.comBottlenecks of GSP and SPADE  A huge set of candidates could be generated  1,000 frequent length1 sequences generate s huge number of 1000999 length2 candidates 100010001,499,500 2  Multiple scans of database in mining  Mining long sequential patterns  Needs an exponential number of short candidates 30  A length100 sequential pattern needs 10 candidate sequences 100 100  100 30  2110   i i1  21 July 20, 2017 www.ThesisScientist.comPrefix and Suffix (Projection)  a, aa, a(ab) and a(abc) are prefices of sequence a(abc)(ac)d(cf)  Given sequence a(abc)(ac)d(cf) Prefix Suffix (PrefixBased Projection) a (abc)(ac)d(cf) aa (bc)(ac)d(cf) a(ab) (c)(ac)d(cf) 22 July 20, 2017 www.ThesisScientist.comMining Sequential Patterns by Prefix Projections  Step 1: find length1 sequential patterns  a, b, c, d, e, f  Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:  The ones having prefix a;  The ones having prefix b; SID sequence 10 a(abc)(ac)d(cf)  … 20 (ad)c(bc)(ae)  The ones having prefix f 30 (ef)(ab)(df)cb 40 eg(af)cbc 23 July 20, 2017 www.ThesisScientist.comFinding Seq. Patterns with Prefix a  Only need to consider projections w.r.t. a  aprojected database: (abc)(ac)d(cf), (d)c(bc)(ae), (b)(df)cb, (f)cbc  Find all the length2 seq. pat. Having prefix a: aa, ab, (ab), ac, ad, af  Further partition into 6 subsets SID sequence 10 a(abc)(ac)d(cf)  Having prefix aa; 20 (ad)c(bc)(ae)  … 30 (ef)(ab)(df)cb 40 eg(af)cbc  Having prefix af 24 July 20, 2017 www.ThesisScientist.comCompleteness of PrefixSpan SDB SID sequence Length1 sequential patterns 10 a(abc)(ac)d(cf) 20 (ad)c(bc)(ae) a, b, c, d, e, f 30 (ef)(ab)(df)cb 40 eg(af)cbc Having prefix c, …, f Having prefix a Having prefix b aprojected database bprojected database … (abc)(ac)d(cf) Length2 sequential patterns (d)c(bc)(ae) aa, ab, (ab), (b)(df)cb ac, ad, af (f)cbc … … Having prefix aa Having prefix af aaproj. db afproj. db … 25 July 20, 2017 www.ThesisScientist.comEfficiency of PrefixSpan  No candidate sequence needs to be generated  Projected databases keep shrinking  Major cost of PrefixSpan: constructing projected databases  Can be improved by pseudoprojections 26 July 20, 2017 www.ThesisScientist.comSpeedup by Pseudoprojection  Major cost of PrefixSpan: projection  Postfixes of sequences often appear repeatedly in recursive projected databases  When (projected) database can be held in main memory, use pointers to form projections  Pointer to the sequence s=a(abc)(ac)d(cf)  Offset of the postfix a sa: ( , 2) (abc)(ac)d(cf) ab sab: ( , 4) (c)(ac)d(cf) 27 July 20, 2017 www.ThesisScientist.comPseudoProjection vs. Physical Projection  Pseudoprojection avoids physically copying postfixes  Efficient in running time and space when database can be held in main memory  However, it is not efficient when database cannot fit in main memory  Diskbased random accessing is very costly  Suggested Approach:  Integration of physical and pseudoprojection  Swapping to pseudoprojection when the data set fits in memory 28 July 20, 2017 www.ThesisScientist.comPerformance on Data Set C10T8S8I8 29 July 20, 2017 www.ThesisScientist.comCloSpan: Mining Closed Sequential Patterns  A closed sequential pattern s: there exists no superpattern s’ such that s’ כ s, and s’ and s have the same support  Motivation: reduces the number of (redundant) patterns but attains the same expressive power  Using Backward Subpattern and Backward Superpattern pruning to prune redundant search space 30 July 20, 2017 www.ThesisScientist.comConstraintBased Seq.Pattern Mining  Constraintbased sequential pattern mining  Constraints: Userspecified, for focused mining of desired patterns  How to explore efficient mining with constraints — Optimization  Classification of constraints  Antimonotone: E.g., valuesum(S) 150, min(S) 10  Monotone: E.g., count (S) 5, S  PC, digitalcamera  Succinct: E.g., length(S)  10, S  Pentium, MS/Office, MS/Money  Convertible: E.g., valueavg(S) 25, profitsum (S) 160, max(S)/avg(S) 2, median(S) – min(S) 5  Inconvertible: E.g., avg(S) – median(S) = 0 31 July 20, 2017 www.ThesisScientist.comFrom Sequential Patterns to Structured Patterns  Sets, sequences, trees, graphs, and other structures  Transaction DB: Sets of items  i , i , …, i , … 1 2 m  Seq. DB: Sequences of sets:  i , i , …, i , i , i , … 1 2 m n k  Sets of Sequences:  i , i , …, i , i , i , … 1 2 m n k  Sets of trees: t , t , …, t 1 2 n  Sets of graphs (mining for frequent subgraphs):  g , g , …, g 1 2 n  Mining structured patterns in XML documents, bio chemical structures, etc. 32 July 20, 2017 www.ThesisScientist.comEpisodes and Episode Pattern Mining  Other methods for specifying the kinds of patterns  Serial episodes: A  B  Parallel episodes: A B  Regular expressions: (A B)C(D  E)  Methods for episode pattern mining  Variations of Apriorilike algorithms, e.g., GSP  Database projectionbased pattern growth  Similar to the frequent pattern growth without candidate generation 33 July 20, 2017 www.ThesisScientist.comPeriodicity Analysis  Periodicity is everywhere: tides, seasons, daily power consumption, etc.  Full periodicity  Every point in time contributes (precisely or approximately) to the periodicity  Partial periodicit: A more general notion  Only some segments contribute to the periodicity  Jim reads NY Times 7:007:30 am every week day  Cyclic association rules  Associations which form cycles  Methods  Full periodicity: FFT, other statistical analysis methods  Partial and cyclic periodicity: Variations of Apriorilike mining methods 34 July 20, 2017 www.ThesisScientist.comRef: Mining Sequential Patterns  R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT’96.  H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97.  M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 2001.  J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by PrefixProjected Pattern Growth. ICDE'01 (TKDE’04).  J. Pei, J. Han and W. Wang, ConstraintBased Sequential Pattern Mining in Large Databases, CIKM'02.  X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03.  J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04.  H. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in Large Database, KDD'04.  J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99.  J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series data, KDD'00. 35 July 20, 2017 www.ThesisScientist.com
Website URL
Comment