Question? Leave a message!




SUBSTRING SEARCH

SUBSTRING SEARCH
ROBERT SEDGEWICK KEVIN WAYNE Algorithms 5.3 SUBSTRING SEARCH introduction ‣ brute force ‣ KnuthMorrisPratt ‣ Algorithms FOUR TH EDITION BoyerMoore ‣ RabinKarp ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu5.3 SUBSTRING SEARCH introduction ‣ brute force ‣ KnuthMorrisPratt ‣ Algorithms BoyerMoore ‣ RabinKarp ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduSubstring search Goal. Find pattern of length M in a text of length N. typically N M pattern N E E D L E text I N A H A Y S T A C K N E E D L E I N A match Substring search 3Substring search applications Goal. Find pattern of length M in a text of length N. typically N M pattern N E E D L E text I N A H A Y S T A C K N E E D L E I N A match Substring search 4Substring search applications Goal. Find pattern of length M in a text of length N. typically N M pattern N E E D L E text I N A H A Y S T A C K N E E D L E I N A match Substring search Computer forensics.  Search memory or disk for signatures, e.g., all URLs or RSA keys that the user has entered. http://citp.princeton.edu/memory 5Substring search applications Goal. Find pattern of length M in a text of length N. typically N M pattern N E E D L E text I N A H A Y S T A C K N E E D L E I N A match Substring search Identify patterns indicative of spam. PROFITS L0SE WE1GHT herbal Viagra There is no catch. This is a onetime mailing. This message is sent in compliance with spam regulations. 6Substring search applications Electronic surveillance. Need to monitor all internet traffic. (security) No way (privacy) Well, we’re mainly interested in “ATTACK AT DAWN” OK. Build a machine that just looks for that. “ATTACK AT DAWN” substring search machine found 7Substring search applications Screen scraping. Extract relevant data from web page. Ex. Find string delimited by b and /b after first occurrence of pattern Last Trade:. ... tr td class= "yfnctablehead1" width= "48" Last Trade: /td td class= "yfnctabledata1" bigb452.92/b/big /td/tr td class= "yfnctablehead1" http://finance.yahoo.com/qs=goog width= "48" Trade Time: /td td class= "yfnctabledata1" ... 8Screen scraping: Java implementation Java library. The indexOf() method in Java's string library returns the index of the first occurrence of a given string, starting at a given offset. public class StockQuote public static void main(String args) String name = "http://finance.yahoo.com/qs="; In in = new In(name + args0); String text = in.readAll(); int start = text.indexOf("Last Trade:", 0); int from = text.indexOf("b", start); int to = text.indexOf("/b", from); String price = text.substring(from + 3, to); StdOut.println(price); java StockQuote goog 582.93 java StockQuote msft 24.84 95.3 SUBSTRING SEARCH introduction ‣ brute force ‣ KnuthMorrisPratt ‣ Algorithms BoyerMoore ‣ RabinKarp ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduBruteforce substring search Check for pattern starting at each text position. i j i+j 0 1 2 3 4 5 6 7 8 9 10 A B A C A D A B R A C txt pat 0 2 2 A B R A entries in red are 1 0 1 A B R A mismatches 2 1 3 A B R A entries in gray are 3 0 3 A B R A for reference only 4 1 5 A B R A entries in black 5 0 5 A B R A match the text 6 4 10 A B R A return i when j is M match Bruteforce substring search 11i j i+j 0 1 2 3 4 5 6 7 8 9 10 A B A C A D A B R A C 4 3 7 A D A C R 5 0 5 A D A C R Bruteforce substring search: Java implementation Check for pattern starting at each text position. public static int search(String pat, String txt) int M = pat.length(); int N = txt.length(); for (int i = 0; i = N M; i++) int j; for (j = 0; j M; j++) if (txt.charAt(i+j) = pat.charAt(j)) break; index in text where if (j == M) return i; pattern starts return N; not found 12Bruteforce substring search: worst case Bruteforce algorithm can be slow if text and pattern are repetitive. i j i+j 0 1 2 3 4 5 6 7 8 9 10 i j i+j 0 1 2 3 4 5 6 7 8 9 txt A B A C A D A B R A C txt A A A A A A A A A B 0 2 2 A B R A pat pat 0 4 4 A A A A B entries in red are 1 0 1 A B R A 1 4 5 A A A A B mismatches 2 1 3 A B R A 2 4 6 A A A A B entries in gray are 3 0 3 A B R A for reference only 3 4 7 A A A A B 4 1 5 A B R A entries in black 4 4 8 A A A A B 5 0 5 A B R A match the text 5 5 10 A A A A B 6 4 10 A B R A return i when j is M Bruteforce substring search (worst case) match Bruteforce substring search Worst case. M N char compares. 13A A A A A A A A A A A A A A A A A A A A A B A A A A A B A A A A A A A A A A A A A A A A A A A A A B A A A A A B shift pattern right one position Backup In many applications, we want to avoid backup in text stream. Treat input as stream of data. “ATTACK AT DAWN” substring search Abstract model: standard input. machine found Bruteforce algorithm needs backup for every mismatch. matched chars mismatch backup Approach 1. Maintain buffer of last M characters. Approach 2. Stay tuned. 14i j 0 1 2 3 4 5 6 7 8 9 10 A B A C A D A B R A C 7 3 A D A C R 5 0 A D A C R Bruteforce substring search: alternate implementation Same sequence of char compares as previous implementation. i points to end of sequence of alreadymatched chars in text. j stores of alreadymatched chars (end of sequence in pattern). public static int search(String pat, String txt) int i, N = txt.length(); int j, M = pat.length(); for (i = 0, j = 0; i N j M; i++) if (txt.charAt(i) == pat.charAt(j)) j++; explicit backup else i = j; j = 0; if (j == M) return i M; else return N; 15Algorithmic challenges in substring search Bruteforce is not always good enough. fundamental algorithmic problem Theoretical challenge. Lineartime guarantee. often no room or time to save text Practical challenge. Avoid backup in text stream. Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for each good person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party. Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their attack at dawn party. Now is the time for each person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party. 165.3 SUBSTRING SEARCH introduction ‣ brute force ‣ KnuthMorrisPratt ‣ Algorithms BoyerMoore ‣ RabinKarp ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduKnuthMorrisPratt substring search Intuition. Suppose we are searching in text for pattern BAAAAAAAAA. th Suppose we match 5 chars in pattern, with mismatch on 6 char. We know previous 6 chars in text are BAAAAB. Don't need to back up text pointer assuming A, B alphabet ii te tex xt t A B A A A A B A A A A A A A A A A B A A A A B A A A A A A A A A aft afte er mismat r mismatc ch h p patt atte er rn n B A A A A B A A A A AA A A A A A A A A on six on sixth c th char har B B A A A A A A A A A A A A A A A A A A br brut utef efor orc ce b e bac acks ks up t up to t o tr ry this y this B B A A A A A A A A A A A A A A A A A A and this and this B B A A A A A A A A A A A A A A A A A A and this and this B B A A A A A A A A A A A A A A A A A A and this and this B A A B A A A A A A A A A A A A A A A A and this and this B B A A A A A A A A A A A A A A A A A A b but no b ut no bac ackup kup is nee is neede ded d T Te ex xt poin t point ter backup in substring sear er backup in substring searching ching KnuthMorrisPratt algorithm. Clever method to always avoid backup. () 18Deterministic finite state automaton (DFA) DFA is abstract stringsearching machine. Finite number of states (including start and halt). Exactly one transition for each char in alphabet. Accept if sequence of transitions leads to halt state. X internal representation X j 0 1 2 3 4 5 j 0 1 2 3 4 5 pat.charAt(j) A B A B A C If in state j reading char c: pat.charAt(j) A B A B A C A 1 1 3 1 5 1 if j is 6 halt and accept A 1 1 3 1 5 1 dfaj 0 2 0 4 0 4 B dfaj 0 2 0 4 0 4 B else move to state dfacj 0 0 0 0 0 6 C 0 0 0 0 0 6 C graphical representation A A A A B,C B,C A A B B 0 A B A B A C 1 5 6 2 3 4 A B A A C 0 B 1 5 6 2 3 4 C B,C C C B,C B,C C B,C Constructing the DFA for KMP substring search for A B A B A C 19 Constructing the DFA for KMP substring search for A B A B A CKnuthMorrisPratt demo: DFA simulation A A B A C A A B A B A C A A 0 1 2 3 4 5 pat.charAt(j) A B A B A C A 1 1 3 1 5 1 dfaj B 0 2 0 4 0 4 C 0 0 0 0 0 6 A A B, C A B 0 A 1 B 2 A 3 B 4 A 5 C 6 C B, C C B, C 20KnuthMorrisPratt demo: DFA simulation A A B A C A A B A B A C A A 0 1 2 3 4 5 pat.charAt(j) A B A B A C A 1 1 3 1 5 1 dfaj B 0 2 0 4 0 4 C 0 0 0 0 0 6 A A B, C A B 0 A 1 B 2 A 3 B 4 A 5 C 6 6 C B, C substring found C B, C 21Interpretation of KnuthMorrisPratt DFA Q. What is interpretation of DFA state after reading in txti A. State = number of characters in pattern that have been matched. length of longest prefix of pat that is a suffix of txt0..i Ex. DFA is in state 3 after reading in txt0..6. i 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 txt pat B C B A A B A C A A B A B A C suffix of txt0..6 prefix of pat A A B, C A B 0 A 1 B 2 A 3 B 4 A 5 C 6 C B, C C B, C 22KnuthMorrisPratt substring search: Java implementation Key differences from bruteforce implementation. Need to precompute dfa from pattern. Text pointer i never decrements. public int search(String txt) int i, j, N = txt.length(); for (i = 0, j = 0; i N j M; i++) no backup j = dfatxt.charAt(i)j; if (j == M) return i M; else return N; Running time. Simulate DFA on text: at most N character accesses. Build DFA: how to do efficiently warning: tricky algorithm ahead 23KnuthMorrisPratt substring search: Java implementation Key differences from bruteforce implementation. Need to precompute dfa from pattern. Text pointer i never decrements. Could use input stream. public int search(In in) int i, j; for (i = 0, j = 0; in.isEmpty() j M; i++) no backup j = dfain.readChar()j; if (j == M) return i M; else return NOTFOUND; X 0 1 2 3 4 5 j pat.charAt(j) A B A B A C A 1 1 3 1 5 1 dfaj B 0 2 0 4 0 4 0 0 0 0 0 6 C A A B,C A B 0 A 1 B A B A 5 C 6 2 3 4 C B,C C B,C Constructing the DFA for KMP substring search for A B A B A C 24KnuthMorrisPratt demo: DFA construction Include one state for each character in pattern (plus accept state). 0 1 2 3 4 5 pat.charAt(j) A B A B A C A dfaj B C Constructing the DFA for KMP substring search for A B A B A C 0 1 2 3 4 5 6 25KnuthMorrisPratt demo: DFA construction 0 1 2 3 4 5 pat.charAt(j) A B A B A C A 1 1 3 1 5 1 dfaj B 0 2 0 4 0 4 C 0 0 0 0 0 6 Constructing the DFA for KMP substring search for A B A B A C A A B, C A B 0 A 1 B 2 A 3 B 4 A 5 C 6 C B, C C B, C 26How to build DFA from pattern Include one state for each character in pattern (plus accept state). 0 1 2 3 4 5 pat.charAt(j) A B A B A C A dfaj B C 0 1 2 3 4 5 6 27How to build DFA from pattern Match transition. If in state j and next char c == pat.charAt(j), go to j+1. first j characters of pattern next char matches now first j +1 characters of have already been matched pattern have been matched 0 1 2 3 4 5 pat.charAt(j) A B A B A C A 1 3 5 dfaj B 2 4 C 6 0 A 1 B 2 A 3 B 4 A 5 C 6 28How to build DFA from pattern Mismatch transition. If in state j and next char c = pat.charAt(j), then the last j1 characters of input are pat1..j1, followed by c. To compute dfacj: Simulate pat1..j1 on DFA and take transition c. still under construction () Running time. Seems to require j steps. Ex. dfa'A'5 = 1; dfa'B'5 = 4 simulate BABA; simulate BABA; j 0 1 2 3 4 5 take transition 'A' take transition 'B' pat.charAt(j) A B A B A C = dfa'A'3 = dfa'B'3 A simulation of BABA j A B, C A A B 0 A 1 B 2 A 3 3 B B 4 A 5 C 6 C B, C C B, C 29How to build DFA from pattern Mismatch transition. If in state j and next char c = pat.charAt(j), then the last j1 characters of input are pat1..j1, followed by c. state X To compute dfacj: Simulate pat1..j1 on DFA and take transition c. Running time. Takes only constant time if we maintain state X. Ex. dfa'A'5 = 1; dfa'B'5 = 4 X' = 0 from state X, from state X, from state X, 0 1 2 3 4 5 take transition 'A' take transition 'B' take transition 'C' A B A B A C = dfa'A'X = dfa'B'X = dfa'C'X A X j A B, C A A B 0 A 1 B 2 A 3 B B 4 A 5 C 6 C B, C C B, C 30KnuthMorrisPratt demo: DFA construction in linear time Include one state for each character in pattern (plus accept state). 0 1 2 3 4 5 pat.charAt(j) A B A B A C A dfaj B C Constructing the DFA for KMP substring search for A B A B A C 0 1 2 3 4 5 6 31KnuthMorrisPratt demo: DFA construction in linear time 0 1 2 3 4 5 pat.charAt(j) A B A B A C A 1 1 3 1 5 1 dfaj B 0 2 0 4 0 4 C 0 0 0 0 0 6 Constructing the DFA for KMP substring search for A B A B A C A A B, C A B 0 A 1 B 2 A 3 B 4 A 5 C 6 C B, C C B, C 32Constructing the DFA for KMP substring search: Java implementation For each state j: Copy dfaX to dfaj for mismatch case. Set dfapat.charAt(j)j to j+1 for match case. Update X. public KMP(String pat) this.pat = pat; M = pat.length(); dfa = new intRM; dfapat.charAt(0)0 = 1; for (int X = 0, j = 1; j M; j++) for (int c = 0; c R; c++) copy mismatch cases dfacj = dfacX; set match case dfapat.charAt(j)j = j+1; update restart state X = dfapat.charAt(j)X; Running time. M character accesses (but space/time proportional to R M). 33KMP substring search analysis Proposition. KMP substring search accesses no more than M + N chars to search for a pattern of length M in a text of length N. Pf. Each pattern char accessed once when constructing the DFA; each text char accessed once (in the worst case) when simulating the DFA. internal representation j 0 1 2 3 4 5 pat.charAt(j) A B A B A C Proposition. KMP constructs dfa in time and space proportional to R M. nextj 0 0 0 0 0 3 Larger alphabets. Improved version of KMP constructs nfa in time and mismatch transition (back up at least one state) space proportional to M. graphical representation 0 A 1 B A B A C 2 3 4 5 6 KMP NFA for ABABAC NFA corresponding to the string A B A B A C 34KnuthMorrisPratt: brief history Independently discovered by two theoreticians and a hacker. – Knuth: inspired by esoteric theorem, discovered linear algorithm – Pratt: made running time independent of alphabet size – Morris: built a text editor for the CDC 6400 computer Theory meets practice. SIAM J. COMPUT. Vol. No. 2, June 1977 6, MATCHING IN STRINGS FAST PATTERN AND VAUGHAN R. JAMES H. MORRIS, JR.:l: PRATT DONALD E. KNUTHf, of one. within is which finds all occurrences given string Abstract. An algorithm presented The constant of running time proportional to the sum of the lengths of the strings. another, in can also be to make this of practical use, and the procedure is low enough algorithm proportionality ofthe withsomemore problems.A theoretical application extended to deal generalpatternmatching can, canbe the set of concatenations of even palindromes, i.e., the language algorithmshows that onthe are alsoconsidered. lineartime. Other whichrunevenfaster average recognized in algorithms words, trie searching, period of a textediting, patternmatching, memory, Key pattern, string, optimum algorithm, Fibonacci string, regular expression string, palindrome, are often to search through a of Textediting programs required string characters looking for instances of a given "pattern" string; we wish to find all or the leftmost which the as a positions, perhaps only position, in pattern occurs of the text. For c a e n a r contains the contiguous substring example, y pattern e butwe do not regard c a n a ry as a substring. n, Jim Morris Vaughan Pratt Don Knuth The obviousway to search for a matching pattern is to try searching at every 35 of the the as soon as an incorrect starting position text, abandoning search character is found. But this canbe inefficient, forexamplewhenwe approach very are looking for an occurrence of aaaaaaab in aaaaaaaaaaaaaab. When the pattern isa"b and the text is a2"b,we willfindourselvesmaking (n+ 1) of characters. the traditional involves comparisons Furthermore, approach "backing up" the input text as we go through it, and this can add annoying complications when we consider the buffering operations that are frequently involved. In this paper we describe a which finds all patternmatching algorithm occurrences of a pattern of length rn within a text of length n in O(rn+n) units of without the text. The needs time, "backing up" input algorithm only O(m) locations of internal if the text is read from an external and only memory file, O(logm) units of time elapsebetween consecutive singlecharacter inputs. All of the constants of proportionality implied by these "O" formulas are independent of the size. alphabet Received by the editors August and 29, 1974, in revised form April 7, 1976. t Computer Science Department, Stanford California University, Stanford, 94305. The work of this authorwas supported inpartbythe National ScienceFoundationunderGrantGJ36473Xand by the Office of Naval Research under ContractNR 044402. Xerox Palo Alto Research Center, Palo California Alto, 94304. The work of this author was supported in part by the National Science Foundation under Grant GP 7635 at the University of California, Berkeley. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Mas sachusetts 02139. The work of this authorwas supported in part by the National Science Foundation under Grant GP6945 at University of and California, Berkeley, under Grant GJ992 at Stanford University. 3235.3 SUBSTRING SEARCH introduction ‣ brute force ‣ KnuthMorrisPratt ‣ Algorithms BoyerMoore ‣ RabinKarp ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu Robert Boyer J. Strother MooreBoyerMoore: mismatched character heuristic Intuition. Scan characters in pattern from right to left. Can skip as many as M text chars when finding one not in the pattern. i j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 text F I N D I N A H A Y S T A C K N E E D L E I N A pattern 0 5 N E E D L E 5 5 N E E D L E 11 4 N E E D L E 15 0 N E E D L E return i = 15 Mismatched character heuristic for righttoleft (BoyerMoore) substring search 37BoyerMoore: mismatched character heuristic Q. How much to skip Case 1. Mismatch character not in pattern. i before . . . . . . T L E . . . . . . txt pat N E E D L E i after . . . . . . T L E . . . . . . txt pat N E E D L E mismatch character 'T' not in pattern: increment i one character beyond 'T' 38BoyerMoore: mismatched character heuristic Q. How much to skip Case 2a. Mismatch character in pattern. i before . . . . . . N L E . . . . . . txt pat N E E D L E i after . . . . . . N L E . . . . . . txt pat N E E D L E mismatch character 'N' in pattern: align text 'N' with rightmost pattern 'N' 39BoyerMoore: mismatched character heuristic Q. How much to skip Case 2b. Mismatch character in pattern (but heuristic no help). i before . . . . . . E L E . . . . . . txt pat N E E D L E i aligned with rightmost E . . . . . . E L E . . . . . . txt pat N E E D L E mismatch character 'E' in pattern: align text 'E' with rightmost pattern 'E' 40BoyerMoore: mismatched character heuristic Q. How much to skip Case 2b. Mismatch character in pattern (but heuristic no help). i before . . . . . . E L E . . . . . . txt pat N E E D L E i after . . . . . . E L E . . . . . . txt pat N E E D L E mismatch character 'E' in pattern: increment i by 1 41BoyerMoore: mismatched character heuristic Q. How much to skip A. Precompute index of rightmost occurrence of character c in pattern. (1 if character not in pattern) N E E D L E c rightc 0 1 2 3 4 5 A 1 1 1 1 1 1 1 1 B 1 1 1 1 1 1 1 1 right = new intR; C 1 1 1 1 1 1 1 1 for (int c = 0; c R; c++) D 1 1 1 1 3 3 3 3 rightc = 1; for (int j = 0; j M; j++) E 1 1 1 2 2 2 5 5 rightpat.charAt(j) = j; ... 1 L 1 1 1 1 1 4 4 4 M 1 1 1 1 1 1 1 1 N 1 0 0 0 0 0 0 0 ... 1 BoyerMoore skip table computation 42BoyerMoore: Java implementation public int search(String txt) int N = txt.length(); int M = pat.length(); int skip; for (int i = 0; i = NM; i += skip) skip = 0; for (int j = M1; j = 0; j) compute skip value if (pat.charAt(j) = txt.charAt(i+j)) skip = Math.max(1, j righttxt.charAt(i+j)); break; in case other term is nonpositive match if (skip == 0) return i; return N; 43BoyerMoore: analysis Property. Substring search with the BoyerMoore mismatched character heuristic takes about N / M character compares to search for a pattern of length M in a text of length N. sublinear Worstcase. Can be as bad as M N. i skip 0 1 2 3 4 5 6 7 8 9 txt B B B B B B B B B B pat 0 0 A B B B B 1 1 A B B B B 2 1 A B B B B 3 1 A B B B B 4 1 A B B B B 5 1 A B B B B BoyerMooreHorspool substring search (worst case) BoyerMoore variant. Can improve worst case to 3 N character compares by adding a KMPlike rule to guard against repetitive patterns. 445.3 SUBSTRING SEARCH introduction ‣ brute force ‣ KnuthMorrisPratt ‣ Algorithms BoyerMoore ‣ RabinKarp ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu Michael Rabin Dick KarpRabinKarp fingerprint search Basic idea = modular hashing. Compute a hash of pat0..M1. For each i, compute a hash of txti..M+i1. If pattern hash = text substring hash, check for a match. pat.charAt(i) pat.charAt(i) i 0 1 2 3 4 i 0 1 2 3 4 2 6 5 3 5 2 6 5 3 5 997 = 613 997 = 613 txt.charAt(i) txt.charAt(i) i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3 1 4 1 5 9 3 1 4 1 5 9 2 6 5 3 5 2 6 5 3 5 8 9 7 9 3 8 9 7 9 3 0 3 1 4 1 5 0 3 1 4 1 5 997 = 508 997 = 508 1 1 4 1 5 9 1 1 4 1 5 9 997 = 201 997 = 201 2 4 1 5 9 2 2 4 1 5 9 2 997 = 715 997 = 715 3 1 5 9 2 6 3 1 5 9 2 6 997 = 971 997 = 971 4 5 9 2 6 5 4 5 9 2 6 5 997 = 442 997 = 442 mat matc ch h 5 9 2 6 5 3 5 9 2 6 5 3 997 = 929 997 = 929 r re et tur urn n i = 6 i = 6 66 2 6 5 3 5 2 6 5 3 5 997 = 997 = 613 613 B Basis f asis for R or RabinK abinKarp substring sear arp substring search ch modular hashing with R = 10 and hash(s) = s (mod 997) 46Modular arithmetic Math trick. To keep numbers small, take intermediate results modulo Q. Ex. (10000 + 535) 1000 (mod 997) = (30 + 535) 3 (mod 997) = 1695 (mod 997) = 698 (mod 997) (a + b) mod Q = ((a mod Q) + (b mod Q)) mod Q (a b) mod Q = ((a mod Q) (b mod Q)) mod Q two useful modular arithmetic identities 47Efficiently computing the hash function Modular hash function. Using the notation t for txt.charAt(i), i we wish to compute M–1 M–2 0 x = t R + t R + … + t R (mod Q) i i i+1 i+M–1 Intuition. Mdigit, baseR integer, modulo Q. Horner's method. Lineartime method to evaluate degreeM polynomial. pat.charAt() // Compute hash for Mdigit key i 0 1 2 3 4 private long hash(String key, int M) 2 6 5 3 5 R Q 0 2 997 = 2 long h = 0; 1 2 6 997 = (210 + 6) 997 = 26 for (int j = 0; j M; j++) 2 2 6 5 997 = (2610 + 5) 997 = 265 h = (h R + key.charAt(j)) Q; 3 2 6 5 3 997 = (26510 + 3) 997 = 659 return h; 4 2 6 5 3 5 997 = (65910 + 5) 997 = 613 Computing the hash value for the pattern with Horner’s method 26535 = 210000 + 61000 + 5100 + 310 + 5 = ((((2) 10 + 6) 10 + 5) 10 + 3) 10 + 5 48Efficiently computing the hash function Challenge. How to efficiently compute x given that we know x . i+1 i M–1 M–2 0 x = t R + t R + … + t R i i i+1 i+M–1 M–1 M–2 0 x = t R + t R + … + t R i+1 i+1 i+2 i+M Key property. Can update "rolling" hash function in constant time M–1 x = ( x – t R ) R + t i+1 i i i + M current subtract multiply add new M1 (can precompute R ) value leading digit by radix trailing digit i ... 2 3 4 5 6 7 ... i ... 2 3 4 5 6 7 ... cur curr re ent v nt value alue 11 4 1 5 9 2 4 1 5 9 2 6 5 6 5 t te ex xt t ne new v w value alue 44 1 5 9 2 6 1 5 9 2 6 55 cur curr re ent v nt value alue 4 1 5 9 2 4 1 5 9 2 44 0 0 0 0 0 0 0 0 s sub ubt tr rac act leading dig t leading digit it 1 5 9 2 1 5 9 2 m mult ultip ipl ly b y by r y radix adix 1 0 1 0 1 5 9 2 0 1 5 9 2 0 add ne add new t w tr railing dig ailing digit it + 6 + 6 ne new v w value alue 1 5 9 2 1 5 9 2 66 49 K Ke ey c y computa omputation in R tion in RabinK abinKarp substring sear arp substring search ch (mo (mov ve righ e right one position in the t t one position in the te ex xt) t)RabinKarp substring search example First R entries: Use Horner's rule. Remaining entries: Use rolling hash (and to avoid overflow). i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3 1 4 1 5 9 3 1 4 1 5 9 3 1 4 1 5 9 2 6 5 3 5 2 6 5 3 5 2 6 5 3 5 8 9 7 9 3 8 9 7 9 3 8 9 7 9 3 QQQ 0 3 0 3 0 3 997 = 997 = 997 = 333 1 3 1 1 3 1 1 3 1 997 = (310 + 997 = (310 + 997 = (310 + 111) 997 = 31 ) 997 = 31 ) 997 = 31 Horner's 2 3 1 4 2 3 1 4 2 3 1 4 997 = (3110 + 997 = (3110 + 997 = (3110 + 444) 997 = 314 ) 997 = 314 ) 997 = 314 rule 3 3 1 4 1 3 3 1 4 1 3 3 1 4 1 997 = (31410 + 997 = (31410 + 997 = (31410 + 111) 997 = 150 ) 997 = 150 ) 997 = 150 RM RM RM RRR 4 3 1 4 1 5 4 3 1 4 1 5 4 3 1 4 1 5 997 = (15010 + 997 = (15010 + 997 = (15010 + 555) 997 = 508 ) 997 = 508 ) 997 = 508 5 1 4 1 5 9 5 1 4 1 5 9 5 1 4 1 5 9 997 = ((508 + 997 = ((508 + 997 = ((508 + 333(997 30))10 + (997 30))10 + (997 30))10 + 999) 997 = 201 ) 997 = 201 ) 997 = 201 6 4 1 5 9 2 6 4 1 5 9 2 6 4 1 5 9 2 997 = ((201 + 997 = ((201 + 997 = ((201 + 111(997 30))10 + (997 30))10 + (997 30))10 + 222) 997 = 715 ) 997 = 715 ) 997 = 715 rolling 7 1 5 9 2 6 7 1 5 9 2 6 7 1 5 9 2 6 997 = ((715 + 997 = ((715 + 997 = ((715 + 444(997 30))10 + (997 30))10 + (997 30))10 + 666) 997 = 971 ) 997 = 971 ) 997 = 971 hash 8 5 9 2 6 5 8 5 9 2 6 5 8 5 9 2 6 5 997 = ((971 + 997 = ((971 + 997 = ((971 + 111(997 30))10 + (997 30))10 + (997 30))10 + 555) 997 = 442 ) 997 = 442 ) 997 = 442 mat mat matc c ch h h 9 9 2 6 5 3 9 9 2 6 5 3 9 9 2 6 5 3 997 = ((442 + 997 = ((442 + 997 = ((442 + 555(997 30))10 + (997 30))10 + (997 30))10 + 333) 997 = 929 ) 997 = 929 ) 997 = 929 10 10 10 r r re e et t tur ur urn n n iM+1 = 6 iM+1 = 6 iM+1 = 6 2 6 5 3 5 2 6 5 3 5 2 6 5 3 5 997 = ((929 + 997 = ((929 + 997 = ((929 + 999(997 30))10 + (997 30))10 + (997 30))10 + 555) 997 = ) 997 = ) 997 = 613 613 613 R R RabinK abinK abinKarp substring sear arp substring sear arp substring search e ch e ch example xample xample 30 (mod 997) = 997 – 30 10000 (mod 997) = 30 50RabinKarp: Java implementation public class RabinKarp private long patHash; // pattern hash value private int M; // pattern length private long Q; // modulus private int R; // radix private long RM1; // R(M1) Q public RabinKarp(String pat) M = pat.length(); R = 256; a large prime Q = longRandomPrime(); (but avoid overflow) M – 1 RM1 = 1; precompute R (mod Q) for (int i = 1; i = M1; i++) RM1 = (R RM1) Q; patHash = hash(pat, M); private long hash(String key, int M) / as before / public int search(String txt) / see next slide / 51RabinKarp: Java implementation (continued) Monte Carlo version. Return match if hash match. public int search(String txt) check for hash collision using rolling hash function int N = txt.length(); int txtHash = hash(txt, M); if (patHash == txtHash) return 0; for (int i = M; i N; i++) txtHash = (txtHash + Q RMtxt.charAt(iM) Q) Q; txtHash = (txtHashR + txt.charAt(i)) Q; if (patHash == txtHash) return i M + 1; return N; Las Vegas version. Check for substring match if hash match; continue search if false collision. 52RabinKarp analysis 2 Theory. If Q is a sufficiently large random prime (about M N ), then the probability of a false collision is about 1 / N. Practice. Choose Q to be a large prime (but not so large to cause overflow). Under reasonable assumptions, probability of a collision is about 1 / Q. Monte Carlo version. Always runs in linear time. Extremely likely to return correct answer (but not always). Las Vegas version. Always returns correct answer. Extremely likely to run in linear time (but worst case is M N). 53RabinKarp fingerprint search Advantages. Extends to 2d patterns. Extends to finding multiple patterns. Disadvantages. Arithmetic ops slower than char compares. Las Vegas version requires backup. Poor worstcase guarantee. Q. How would you extend RabinKarp to efficiently search for any one of P possible patterns in a text of length N 54 5.3 Substring Search 679 RabinKarp substring search is known as a fingerprint search because it uses a small amount of information to represent a (potentially very large) pattern. Then it looks 5.3 Substring Search 679 for this fingerprint (the hash value) in the text. The algorithm is efficient because the fingerprints can be efficiently computed and compared. RabinKarp substring search is known as a fingerprint search because it uses a small Summary The table at the bottom of the page summarizes the algorithms that we amount of information to represent a (potentially very large) pattern. Then it looks have discussed for substring search. As is often the case when we have several algo for this fingerprint (the hash value) in the text. The algorithm is efficient because the rithms for the same task, each of them has attractive features. Brute force search is easy fingerprints can be efficiently computed and compared. to implement and works well in typical cases (Java’s indexOf() method in String uses bruteforce search); KnuthMorrisPratt is guaranteed lineartime with no backup in Substring search cost summary Summary The table at the bottom of the page summarizes the algorithms that we the input; BoyerMoore is sublinear (by a factor of M) in typical situations; and Rabin have discussed for substring search. As is often the case when we have several algo Karp is linear. Each also has drawbacks: bruteforce might require time proportional rithms for the same task, each of them has attractive features. Brute force search is easy to MN; KnuthMorrisPratt and BoyerMoore use extra space; and RabinKarp has a Cost of searching for an Mcharacter pattern in an Ncharacter text. to implement and works well in typical cases (Java’s indexOf() method in String uses relatively long inner loop (several arithmetic operations, as opposed to character com bruteforce search); KnuthMorrisPratt is guaranteed lineartime with no backup in pares in the other methods. These characteristics are summarized in the table below. the input; BoyerMoore is sublinear (by a factor of M) in typical situations; and Rabin Karp is linear. Each also has drawbacks: bruteforce might require time proportional operation count backup extra algorithm version correct to MN; KnuthMorrisPratt and BoyerMoore use extra space; and RabinKarp has a in input space guarantee typical relatively long inner loop (several arithmetic operations, as opposed to character com pares in the other methods. These characteristics are summarized in the table below. brute force — yes yes M N 1.1 N 1 full DFA operation count backup extra no yes 2 N 1.1 N MR algorithm version correct (Algorithm 5.6 ) in input space guarantee typical KnuthMorrisPratt mismatch no yes M 3 N 1.1 N brute force — yes yes M N 1.1 N 1 transitions only full DFA full algorithm yes yes 3 N N / M R no yes 2 N 1.1 N MR (Algorithm 5.6 ) KnuthMorrisPratt BoyerMoore mismatched char mismatch heuristic only ye no s ye yess M M 3 N N N 1.1 / M N R transitions only (Algorithm 5.7 ) full algorithm yes yes 3 N N / M R Monte Carlo † no 1 7 N 7 N yes (Algorithm 5.8 ) † R BaobyinK erMaoropre mismatched char † † heuristic only yes yes M N N / M R Las Vegas yes 1 7 N 7 N no (Algorithm 5.7 ) † probabilisitic guarantee, with uniform hash function Monte Carlo † no 1 7 N 7 N yes Cost summary for substringsearch implementations (Algorithm 5.8 ) † RabinKarp 55 † † Las Vegas yes 1 7 N 7 N no † probabilisitic guarantee, with uniform hash function Cost summary for substringsearch implementations
Website URL
Comment