Question? Leave a message!




R-way tries

R-way tries
ROBERT SEDGEWICK KEVIN WAYNE Algorithms 5.2 TRIES Rway tries ‣ ternary search tries ‣ characterbased operations ‣ Algorithms FOUR TH EDITION ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduSummary of the performance of symboltable implementations Order of growth of the frequency of operations. typical case typical case typical case or order dered ed operations operations implementation implementation operations operations on keys on keys search insert delete compareTo() ✔ redblack BST log N log N log N equals() † † † hash table 1 1 1 hashCode() † under uniform hashing assumption use array accesses to make Rway decisions (instead of binary decisions) Q. Can we do better A. Yes, if we can avoid examining the entire key, as with string sorting. 2String symbol table basic API String symbol table. Symbol table specialized to string keys. public class public class String StringSTValue STValue String StringST() ST() create an empty symbol table void put( put(String String key, Value val) key, Value val) put keyvalue pair into the symbol table Value get( get(String String key) key) return value paired with given key void delete( delete(String String key) key) delete key and corresponding value ⋮⋮ Goal. Faster than hashing, more flexible than BSTs. 3String symbol table implementations cost summary character accesses (typical case) character accesses (typical case) character accesses (typical case) character accesses (typical case) dedup dedup search search space implementation insert moby.txt actors.txt hit miss (references) 2 2 2 redblack BST L + c lg N c lg N c lg N 4N 1.40 97.4 hashing L L L 4N to 16N 0.76 40.6 (linear probing) Parameters file size words distinct • N = number of strings moby.txt 1.2 MB 210 K 32 K • L = length of string actors.txt 82 MB 11.4 M 900 K • R = radix Challenge. Efficient performance for string keys. 45.2 TRIES Rway tries ‣ ternary search tries ‣ characterbased operations ‣ Algorithms ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduTries 6Tries Tries. from retrieval, but pronounced "try" Store characters in nodes (not keys). Each node has R children, one for each possible character. (for now, we do not draw null links) link to trie for all keys that start with s root link to trie for all keys that start with she b s t y e h h 4 key value by 4 o a l e e 0 5 6 sea 6 value for she in node sells 1 r l l corresponding to last key character she 0 shells 3 e s l 7 1 shore 7 the 5 s 3 7Search in a trie Follow links corresponding to each character in the key. Search hit: node where search ends has a nonnull value. Search miss: reach null link or node where search ends has null value. get("shells") b s s t y e h h h 4 o a l e e e 0 5 6 r l ll e s ll 7 1 return value associated s s with last key character 3 3 (return 3) 8Search in a trie Follow links corresponding to each character in the key. Search hit: node where search ends has a nonnull value. Search miss: reach null link or node where search ends has null value. get("she") b s s t y e h h h 4 o a l e e 0 0 e 5 6 search may terminated r l l at an intermediate node (return 0) e s l 7 1 s 3 9Search in a trie Follow links corresponding to each character in the key. Search hit: node where search ends has a nonnull value. Search miss: reach null link or node where search ends has null value. get("shell") b s s t y e h h h 4 o a l e e e 0 5 6 r l ll e s ll 7 1 no value associated with last key character s 3 (return null) 10Search in a trie Follow links corresponding to each character in the key. Search hit: node where search ends has a nonnull value. Search miss: reach null link or node where search ends has null value. get("shelter") b s s t y e h h h 4 o a l e e e 0 5 6 r l ll e s l 7 1 no link to t s 3 (return null) 11Insertion into a trie Follow links corresponding to each character in the key. Encounter a null link: create new node. Encounter the last character of the key: set value in that node. put("shore", 7) b s s t y e h h h 4 o a l e 0 e 5 6 r l l e s l 7 1 s 3 12Trie construction demo trieTrie construction demo trie b s t y e h h 4 o a l e e 0 5 6 r l l e 7 s 1 l s 3Trie representation: Java implementation Node. A value, plus references to R nodes. private static class Node use Object instead of Value since private Object value; no generic array creation in Java private Node next = new NodeR; characters are implicitly neither keys nor s defined by link index characters are s explicitly stored e h e h e a l a 0 e 2 2 0 l l each node has l an array of links s and a value 1 s 1 Trie representation 15Rway trie: Java implementation public class TrieSTValue extended ASCII private static final int R = 256; private Node root = new Node(); private static class Node / see previous slide / public void put(String key, Value val) root = put(root, key, val, 0); private Node put(Node x, String key, Value val, int d) if (x == null) x = new Node(); if (d == key.length()) x.val = val; return x; char c = key.charAt(d); x.nextc = put(x.nextc, key, val, d+1); return x; ⋮ 16Rway trie: Java implementation (continued) ⋮ public boolean contains(String key) return get(key) = null; public Value get(String key) Node x = get(root, key, 0); if (x == null) return null; cast needed return (Value) x.val; private Node get(Node x, String key, int d) if (x == null) return null; if (d == key.length()) return x; char c = key.charAt(d); return get(x.nextc, key, d+1); 17Trie performance Search hit. Need to examine all L characters for equality. Search miss. Could have mismatch on first character. Typical case: examine only a few characters (sublinear). Space. R null links at each leaf. (but sublinear space possible if many short strings share common prefixes) characters are implicitly s defined by link index s e h e h e a l a e 2 0 2 0 l l each node has l an array of links s and a value 1 s 1 Trie representation Bottom line. Fast search hit and even faster search miss, but wastes space. 18Deletion in an Rway trie To delete a keyvalue pair: Find the node corresponding to key and set value to null. If node has null value and all null links, remove that node (and recur). delete("shells") b s s t y e h h h 4 o a l e e e 0 5 6 r l ll e s l 7 1 l s s set value to null 3 19Deletion in an Rway trie To delete a keyvalue pair: Find the node corresponding to key and set value to null. If node has null value and all null links, remove that node (and recur). delete("shells") b s s t y e h h 4 o a l e e 0 5 6 r l l e s 7 1 l null value and links s (delete node) 20String symbol table implementations cost summary character accesses (typical case) character accesses (typical case) character accesses (typical case) character accesses (typical case) dedup dedup search search space implementation insert moby.txt actors.txt hit miss (references) 2 2 2 redblack BST L + c lg N c lg N c lg N 4N 1.40 97.4 hashing L L L 4N to 16N 0.76 40.6 (linear probing) out of Rway trie L log N L (R+1) N 1.12 R memory Rway trie. Method of choice for small R. Too much memory for large R. Challenge. Use less memory, e.g., 65,536way trie for Unicode 215.2 TRIES Rway tries ‣ ternary search tries ‣ characterbased operations ‣ Algorithms ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduTernary search tries Store characters and values in nodes (not keys). Each node has 3 children: smaller (left), equal (middle), larger (right). Fast Algorithms for Sorting and Searching Strings Jon L. Bentley Robert Sedgewick Abstract that is competitive with the most efficient string sorting programs known. The second program is a symbol table We present theoretical algorithms for sorting and implementation that is faster than hashing, which is com searching multikey data, and derive from them practical C monly regarded as the fastest symbol table implementa implementations for applications in which keys are charac tion. The symbol table implementation is much more ter strings. The sorting algorithm blends Quicksort and spaceefficient than multiway trees, and supports more radix sort; it is competitive with the best known C sort advanced searches. codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly In many application programs, sorts use a Quicksort used search methods. The basic ideas behind the algo implementation based on an abstract compare operation, rithms date back at least to the 1960s but their practical and searches use hashing or binary search trees. These do utility has been overlooked. We also present extensions to not take advantage of the properties of string keys, which more complex string problems, such as partialmatch are widely used in practice. Our algorithms provide a nat searching. ural and elegant way to adapt classical algorithms to this important class of applications. 1. Introduction Section 6 turns to more difficult stringsearching prob lems. Partialmatch queries allow “don’t care” characters Section 2 briefly reviews Hoare’s 9 Quicksort and (the pattern “so.a”, for instance, matches soda and sofa). binary search trees. We emphasize a wellknown isomor phism relating the two, and summarize other basic facts. The primary result in this section is a ternary search tree implementation of Rivest’s partialmatch searching algo The multikey algorithms and data structures are pre rithm, and experiments on its performance. “Near neigh sented in Section 3. Multikey Quicksort orders a set of II bor” queries locate all words within a given Hamming dis vectors with k components each. Like regular Quicksort, it 23 tance of a query word (for instance, code is distance 2 partitions its input into sets less than and greater than a from soda). We give a new algorithm for near neighbor given value; like radix sort, it moves on to the next field searching in strings, present a simple C implementation, once the current input is known to be equal in the given and describe experiments on its efficiency. field. A node in a ternary search tree represents a subset of Conclusions are offered in Section 7. vectors with a partitioning value and three pointers: one to lesser elements and one to greater elements (as in a binary 2. Background search tree) and one to equal elements, which are then pro cessed on later fields (as in tries). Many of the structures Quicksort is a textbook divideandconquer algorithm. and analyses have appeared in previous work, but typically To sort an array, choose a partitioning element, permute as complex theoretical constructions, far removed from the elements such that lesser elements are on one side and practical applications. Our simple framework opens the greater elements are on the other, and then recursively sort door for later implementations. the two subarrays. But what happens to elements equal to the partitioning value Hoare’s partitioning method is The algorithms are analyzed in Section 4. Many of the binary: it places lesser elements on the left and greater ele analyses are simple derivations of old results. ments on the right, but equal elements may appear on Section 5 describes efficient C programs derived from either side. the algorithms. The first program is a sorting algorithm Algorithm designers have long recognized the desir Bell Labs, Lucent Technologies, 700 Mountam Avenue, Murray Hill. irbility and difficulty of a ternary partitioning method. NJ 07974; jlbresearch.belllabs.com. Sedgewick 22 observes on page 244: “Ideally, we would Princeton University. Princeron. NJ. 08514: rscs.princeton.edu. llke to get all equal keys1 into position in the file, with all 360 Ternary search tries Store characters and values in nodes (not keys). Each node has 3 children: smaller (left), equal (middle), larger (right). link to TST for all keys link to TST for all keys that start with that start with s a letter before s s t b a a t s b y 4 y 4 e r h h r h u h e u 10 14 e e r e e a 14 e o r e 12 0 8 12 10 0 8 l l a o l r e l r e l l each node has three links s 11 s 11 7 l e 7 l l e l s 15 y 13 s 15 y 13 TST representation of a trie 24Search hit in a TST get("sea") s s b t y h h h 4 e e 0 ll e e 5 o a a 6 6 l l r return value associated with last key character s 1 l e 7 s 3 25Search miss in a TST get("shelter") s s b t y h h h 4 e 0 l e e e 5 o a 6 l ll r s 1 ll e 7 no link to t s 3 (return null) 26Ternary search trie construction demo ternary search trie 27Ternary search trie construction demo ternary search trie s b t y h h 4 e 0 l e e 5 o a 6 l l r s 1 l e 7 s 3 28Search in a TST Follow links corresponding to each character in the key. If less, take left link; if greater, take right link. If equal, take the middle link and move to the next key character. Search hit. Node where search ends has a nonnull value. Search miss. Reach a null link or node where search ends has null value. get("sea") match: take middle link, move to next char mismatch: take left or right link, do not move to next char s t b a y 4 r h h e u 10 e e e r 12 8 l 14 a o l r e l return value s 11 7 l e l associated with last key character s 15 y 13 TST search example 2926way trie vs. TST now for 26way trie. 26 null links in each leaf. tip ilk dim tag jot sob nob sky hut ace bet 26way trie (1035 null links, not shown) men egg few jay owl TST. 3 null links in each leaf. joy rap gig wee was cab wad caw cue fee tap ago tar jam dug TST (155 null links) and 30TST representation in Java A TST node is five fields: private class Node A value. A character c. private Value val; private char c; A reference to a left TST. private Node left, mid, right; A reference to a middle TST. A reference to a right TST. ternary search tree (TST) standard array of links (R = 26) s link for keys s that start with s h e u e h u link for keys that start with su Trie node representations 31TST: Java implementation public class TSTValue private Node root; private class Node / see previous slide / public void put(String key, Value val) root = put(root, key, val, 0); private Node put(Node x, String key, Value val, int d) char c = key.charAt(d); if (x == null) x = new Node(); x.c = c; if (c x.c) x.left = put(x.left, key, val, d); else if (c x.c) x.right = put(x.right, key, val, d); else if (d key.length() 1) x.mid = put(x.mid, key, val, d+1); else x.val = val; return x; ⋮ 32TST: Java implementation (continued) ⋮ public boolean contains(String key) return get(key) = null; public Value get(String key) Node x = get(root, key, 0); if (x == null) return null; return x.val; private Node get(Node x, String key, int d) if (x == null) return null; char c = key.charAt(d); if (c x.c) return get(x.left, key, d); else if (c x.c) return get(x.right, key, d); else if (d key.length() 1) return get(x.mid, key, d+1); else return x; 33String symbol table implementation cost summary character accesses (typical case) character accesses (typical case) character accesses (typical case) character accesses (typical case) dedup dedup search search space implementation insert moby.txt actors.txt hit miss (references) 2 2 2 redblack BST L + c lg N c lg N c lg N 4N 1.40 97.4 hashing L L L 4N to 16N 0.76 40.6 (linear probing) out of Rway trie L log N L (R+1) N 1.12 R memory TST L + ln N ln N L + ln N 4 N 0.72 38.7 Remark. Can build balanced TSTs via rotations to achieve L + log N worstcase guarantees. Bottom line. TST is as fast as hashing (for string keys), space efficient. 342 TST with R branching at root Hybrid of Rway trie and TST. 2 Do R way branching at root. 2 Each of R root nodes points to a TST. 2 array of 26 roots aa ab ac zy zz … TST TST TST TST TST Q. What about one and twoletter words 35String symbol table implementation cost summary character accesses (typical case) character accesses (typical case) character accesses (typical case) character accesses (typical case) dedup dedup search search space implementation insert moby.txt actors.txt hit miss (references) 2 2 2 redblack BST L + c lg N c lg N c lg N 4N 1.40 97.4 hashing L L L 4N to 16N 0.76 40.6 (linear probing) out of Rway trie L log N L (R+1) N 1.12 R memory TST L + ln N ln N L + ln N 4 N 0.72 38.7 2 2 L + ln N ln N L + ln N 4 N + R 0.51 32.7 TST with R Bottom line. Faster than hashing for our benchmark client. 36TST vs. hashing Hashing. Need to examine entire key. Search hits and misses cost about the same. Performance relies on hash function. Does not support ordered symbol table operations. TSTs. Works only for string (or digital) keys. Only examines just enough key characters. Search miss may involve only a few characters. Supports ordered symbol table operations (plus extras). Bottom line. TSTs are: Faster than hashing (especially for search misses). More flexible than redblack BSTs. stay tuned 375.2 TRIES Rway tries ‣ ternary search tries ‣ characterbased operations ‣ Algorithms ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduString symbol table API Characterbased operations. The string symbol table API supports several useful characterbased operations. key value by 4 sea 6 sells 1 she 0 shells 3 shore 7 the 5 Prefix match. Keys with prefix sh: she, shells, and shore. Wildcard match. Keys that match .he: she and the. Longest prefix. Key that is the longest prefix of shellsort: shells. 39String symbol table API public class public class public class StringST StringST StringSTValue Value Value StringST() create a symbol table with string keys void put(String key, Value val) put keyvalue pair into the symbol table Value get(String key) value paired with key void delete(String key) delete key and corresponding value ⋮ IterableString keys() all keys IterableString keysWithPrefix(String s) keys having s as a prefix IterableString keysThatMatch(String s) keys that match s (where . is a wildcard) String longestPrefixOf(String s) longest key that is a prefix of s Remark. Can also add other ordered ST methods, e.g., floor() and rank(). 40Warmup: ordered iteration To iterate through all keys in sorted order: Do inorder traversal of trie; add keys encountered to a queue. Maintain sequence of characters on path from root to node. keys() keysWithPrefix(""); key q b s t b by by s y 4 e h h se a 6 e e sea by sea 0 5 l o sel sell l r l sells by sea sells s e 7 1 l sh she by sea sells she s 3 shell shells by sea sells she shells sho shor shore by sea sells she shells shore t th the by sea sells she shells shore the Collecting the keys in a trie (trace) 41Ordered iteration: Java implementation To iterate through all keys in sorted order: Do inorder traversal of trie; add keys encountered to a queue. Maintain sequence of characters on path from root to node. public IterableString keys() QueueString queue = new QueueString(); collect(root, "", queue); return queue; sequence of characters on path from root to x private void collect(Node x, String prefix, QueueString q) if (x == null) return; if (x.val = null) q.enqueue(prefix); for (char c = 0; c R; c++) collect(x.nextc, prefix + c, q); or use StringBuilder 42Prefix matches Find all keys in a symbol table starting with a given prefix. Ex. Autocomplete in a cell phone, search bar, text editor, or shell. User types characters one at a time. System reports all matching strings. 43Prefix matches in an Rway trie Find all keys in a symbol table starting with a given prefix. keysWithPrefix("sh"); keysWithPrefix("sh"); key q key q sh sh s s t t s s b b t t b b she she she she shel shel y 4 ye4 e y 4 ye4 e h h h h h h h h shell shell shellsshells she shells she shells a 6 a 6 e e e e 0 0 5 5 a 6 a 6e e e e l l o o 0 0 5 5 l l o o sho sho find subfind s trie for al ubtrlie for all shor shor l l l l r r l l l l r r shore shore she shells she shells shore shore keys begk inning w eys beginning w ith "sh" ith "sh" e e s 1 s 1 7 7 s 1 s 1e 7 e 7 l l l l collect kc eo ylslect keys in that sin that s ubtrie ubtrie s 3 s 3 s 3 s 3 Prefi x ma Pre tfich in a trie x match in a trie keysWithPrefix("sh"); public IterableString keysWithPrefix(String prefix) key queue key q sh s t s b t b she she QueueString queue = new QueueString(); shel y 4 e y 4 e h h h Node x = get(root, prefix, 0); h shell shells she shells collect(x, prefix, queue); a 6 e e 5 a 6 e 0 o e 5 l 0 l o sho root of subtrie for all strings find subtrie for all shor return queue; l l r l l r shore she shells shore beginning with given prefix keys beginning with "sh" s 1 e 7 s 1 e 7 l l collect keys in that subtrie 44 s 3 s 3 Prefi x match in a trieLongest prefix Find longest key in symbol table that is a prefix of query string. Ex. To send packet toward destination IP address, router chooses IP address in routing table that is longest prefix match. represented as 32bit "128" binary number for IPv4 "128.112" (instead of string) "128.112.055" "128.112.055.15" longestPrefixOf("128.112.136.11") = "128.112.136" "128.112.136" longestPrefixOf("128.112.100.16") = "128.112" "128.112.155.11" longestPrefixOf("128.166.123.45") = "128" "128.112.155.13" "128.222" "128.222.136" floor("128.112.100.16") = "128.112.055.15" Note. Not the same as floor: 45Longest prefix in an Rway trie Find longest key in symbol table that is a prefix of query string. Search for query string. Keep track of longest key encountered. "she" "she" "she" "shell" "shell" "shell" "shellsort" "shellsort" "shellsort" "shelters" "shelters" "shelters" s s s s s s s s s s s s e e e e e e h h h h h h e e e e e e h h h h h h sear sear ch e sc ear h e nds at cnds at h ends at a a 2 2a 2 e e e a a 2 2a 2 e e e 0 0 0 0 0 0 l l l l l l end o end o fe st nd o fr st ing rfing string a a 2 2a 2 e e e a a 2 2a 2 e e e 0 0 0 0 0 0 l l l l l l value is n value is n value is n ullull ull sear sear ch e sc ear h e nds at cnds at h ends at l l ll l l l l ll l l retr ur etn ur reshe n tur she n she nul n l link ul n l link ull link sear sear ch e sc ear h e nds at cnds at h ends at l l l l l l l l l l l l (last k (last k (last k ey on p ey on p ey on p ath) ath) ath) retr ur etn ur reshe n tur she n she s s 1 1s 1 end o end o fe st nd o fr st ing rfing string s s 1 1s 1 l l l l l l (last k (last k (last k ey on p ey on p ey on p ath) ath) ath) value is no value is no value is no t nt n ullul t n l ull s s 1 1s 1 s s 1 1s 1 l l l l l l ret r ur etn ur reshe n tur she n she s s 3 3s 3 s s s 3 3 3 sear sear ch e sc ear h e nds at cnds at h ends at s s 3 3s 3 s s 3 3s 3 nul n l link ul n l link ull link retr ur etn ur reshells n tur shells n shells (last k (last k (last k ey on p ey on p ey on p ath) ath) ath) Possibilities for longestPrefixOf() Possibilities f Possibilities f Possibilities f or or longestPrefixOf() longestPrefixOf() or longestPrefixOf() 46Longest prefix in an Rway trie: Java implementation Find longest key in symbol table that is a prefix of query string. Search for query string. Keep track of longest key encountered. public String longestPrefixOf(String query) int length = search(root, query, 0, 0); return query.substring(0, length); private int search(Node x, String query, int d, int length) if (x == null) return length; if (x.val = null) length = d; if (d == query.length()) return length; char c = query.charAt(d); return search(x.nextc, query, d+1, length); 47T9 texting Goal. Type text messages on a phone keypad. Multitap input. Enter a letter by repeatedly pressing a key. Ex. hello: 4 4 3 3 5 5 5 5 5 5 6 6 6 "a much faster and more fun way to enter text" T9 text input. Find all words that correspond to given sequence of numbers. Press 0 to see all completion options. Ex. hello: 4 3 5 5 6 Q. How to implement www.t9.com 48Patricia trie Patricia trie. Practical Algorithm to Retrieve Information Coded in Alphanumeric Remove oneway branching. Each node represents a sequence of characters. put("shells", 1); Implementation: one step beyond this course. put("shellfish", 2); standard no oneway trie branching shell s 2 s 1 fish Applications. h Database search. P2P network search. e internal oneway branching IP routing tables: find longest prefix match. l Compressed quadtree for Nbody simulation. l Efficiently storing and querying XML documents. s 1 f external i oneway branching s h 2 Also known as: critbit tree, radix tree. Removing oneway branching in a trie 51Suffix tree Suffix tree. Patricia trie of suffixes of a string. Lineartime construction: well beyond scope of this course. suffix tree for BANANAS BANANAS A NA S NA S S NAS NAS S Applications. Lineartime: longest repeated substring, longest common substring, longest palindromic substring, substring search, tandem repeats, …. Computational biology databases (BLAST, FASTA). 52String symbol tables summary A success story in algorithm design and analysis. Redblack BST. Performance guarantee: log N key compares. Supports ordered symbol table API. Hash tables. Performance guarantee: constant number of probes. Requires good hash function for key type. Tries. Rway, TST. Performance guarantee: log N characters accessed. Supports characterbased operations. Bottom line. You can get at anything by examining 50100 bits () 53
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
Dr.BenjaminClark
User Type:
Teacher
Country:
United States
Uploaded Date:
21-07-2017