Question? Leave a message!




STRING SORTS

STRING SORTS
ROBERT SEDGEWICK KEVIN WAYNE Algorithms 5.1 STRING SORTS strings in Java ‣ keyindexed counting ‣ LSD radix sort ‣ Algorithms FOUR TH EDITION MSD radix sort ‣ 3way radix quicksort ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu suffix arrays ‣5.1 STRING SORTS strings in Java ‣ keyindexed counting ‣ LSD radix sort ‣ Algorithms MSD radix sort ‣ 3way radix quicksort ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu suffix arrays ‣String processing String. Sequence of characters. Important fundamental abstraction. Genomic sequences. Information processing. Communication systems (e.g., email). Programming systems (e.g., Java programs). … “ The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data 0 structure of an organism's biology. ” — M. V. Olson 3The char data type C char data type. Typically an 8bit integer. Supports 7bit ASCII. 6.5 Data Compression 667 Can represent at most 256 characters. ASCII encoding. When you HexDump a bit 0 1 2 3 4 5 6 7 8 9 A B C D E F stream that contains ASCIIencoded charac NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 0 ters, the table at right is useful for reference. DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 1 Given a 2digit hex number, use the first hex SP 2 “ ‘ ( ) + , . / digit as a row index and the second hex digit 3 0 1 2 3 4 5 6 7 8 9 : ; = as a column reference to find the character 4 A B C D E F G H I J K L M N O that it encodes. For example, 31 encodes the 5 P Q R S T U V W X Y Z \ U+0041 U+00E1 U+2202 U+1D50A digit 1, 4A encodes the letter J, and so forth. 6 ` a b c d e f g h i j k l m n o This table is for 7bit ASCII, so the first hex DEL 7 p q r s t u v w x y z some Unicode characters digit must be 7 or less. Hex numbers starting Hexadecimal to ASCII conversion table with 0 and 1 (and the numbers 20 and 7F) correspond to nonprinting control charac ters. Many of the control characters are left over from the days when physical devices like typewriters were controlled by ASCII input; the table highlights a few that you Java char data type. A 16bit unsigned integer. might see in dumps. For example SP is the space character, NUL is the null character, LF is linefeed, and CR is carriagereturn. Supports original 16bit Unicode. Supports 21bit Unicode 3.0 (awkwardly). In summary, working with data compr ession requires us to reorient our thinking about standard input and standard output to include binary encoding of data. BinaryStdIn 4 and BinaryStdOut provide the methods that we need. They provide a way for you to make a clear distinction in your client programs between writing out information in tended for file storage and data transmission (that will be read by programs) and print ing information (that is likely to be read by humans). I ♥︎ Unicode U+0041 5The String data type String data type in Java. Immutable sequence of characters. Length. Number of characters. th Indexing. Get the i character. Concatenation. Concatenate one string to the end of another. s.length() 0 1 2 3 4 5 6 7 8 9 10 11 12 s A T T A C K A T D A W N s.charAt(3) s.substring(7, 11) Fundamental constanttime String operations 6The String data type: immutability Q. Why immutable A. All the usual reasons. Can use as keys in symbol table. Don't need to defensively copy. Ensures consistent state. Supports concurrency. public class FileInputStream Improves security. private String filename; public FileInputStream(String filename) if (allowedToReadFile(filename)) throw new SecurityException(); this.filename = filename; ... attacker could bypass security if string type were mutable 7The String data type: representation Representation (Java 7). Immutable char array + cache of hash. operation Java running time s.length() length 1 s.charAt(i) indexing 1 s + t concatenation M + N ⋮ ⋮ 8String performance trap Q. How to build a long string, one character at a time public static String reverse(String s) String rev = ""; for (int i = s.length() 1; i = 0; i) rev += s.charAt(i); quadratic time return rev; A. Use StringBuilder data type (mutable char array). public static String reverse(String s) StringBuilder rev = new StringBuilder(); for (int i = s.length() 1; i = 0; i) rev.append(s.charAt(i)); linear time return rev.toString(); 9Comparing two strings Q. How many character compares to compare two strings of length W p r e f e t c h 0 1 2 3 4 5 6 7 p r e f i x e s Running time. Proportional to length of longest common prefix. Proportional to W in the worst case. But, often sublinear in W. 10Alphabets Digital key. Sequence of digits over fixed alphabet. 604 CHAPTER 6 Strings Radix. Number of digits R in alphabet. name R() lgR() characters BINARY 2 1 01 OCTAL 8 3 01234567 DECIMAL 10 4 0123456789 HEXADECIMAL 16 4 0123456789ABCDEF DNA 4 2 ACTG LOWERCASE 26 5 abcdefghijklmnopqrstuvwxyz UPPERCASE 26 5 ABCDEFGHIJKLMNOPQRSTUVWXYZ PROTEIN 20 5 ACDEFGHIKLMNPQRSTVWY ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef BASE64 64 6 ghijklmnopqrstuvwxyz0123456789+/ ASCII 128 7 ASCII characters EXTENDEDASCII 256 8 extended ASCII characters UNICODE16 65536 16 Unicode characters Standard alphabets 11 holds the frequencies in Count is an example of a characterindexed array. With a Java String, we have to use an array of size 256; with Alphabet, we just need an array with one entry for each alphabet character. This savings might seem modest, but, as you will see, our algorithms can produce huge numbers of such arrays, and the space for arrays of size 256 can be prohibitive. Numbers. As you can see from our several of the standard Alphabet examples, we of ten represent numbers as strings. The methods toIndices() coverts any String over a given Alphabet into a baseR number represented as an int array with all values between 0 and R1. In some situations, doing this conversion at the start leads to com pact code, because any digit can be used as an index in a characterindexed array. For example, if we know that the input consists only of characters from the alphabet, we could replace the inner loop in Count with the more compact code int a = alpha.toIndices(s); for (int i = 0; i N; i++) countai++;5.1 STRING SORTS strings in Java ‣ keyindexed counting ‣ LSD radix sort ‣ Algorithms MSD radix sort ‣ 3way radix quicksort ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu suffix arrays ‣Review: summary of the performance of sorting algorithms Frequency of operations. algorithm guarantee random extra space stable operations on keys 2 2 compareTo() ✔ insertion sort ½ N ¼ N 1 compareTo() ✔ mergesort N lg N N lg N N compareTo() quicksort 1.39 N lg N 1.39 N lg N c lg N compareTo() heapsort 2 N lg N 2 N lg N 1 probabilistic Lower bound. N lg N compares required by any comparebased algorithm. Q. Can we do better (despite the lower bound) use array accesses to make Rway decisions A. Yes, if we don't depend on key compares. (instead of binary decisions) 13Keyindexed counting: assumptions about keys Assumption. Keys are integers between 0 and R 1. Implication. Can use key as an array index. input sorted result name section (by section) Anderson 2 Harris 1 Applications. Brown 3 Martin 1 Davis 3 Moore 1 Sort string by first letter. Garcia 4 Anderson 2 Harris 1 Martinez 2 Sort class roster by section. Jackson 3 Miller 2 Johnson 4 Robinson 2 Sort phone numbers by area code. Jones 3 White 2 Martin 1 Brown 3 Subroutine in a sorting algorithm. stay tuned Martinez 2 Davis 3 Miller 2 Jackson 3 Moore 1 Jones 3 Remark. Keys may have associated data ⇒ Robinson 2 Taylor 3 Smith 4 Williams 3 can't just count up number of keys of each value. Taylor 3 Garcia 4 Thomas 4 Johnson 4 Thompson 4 Smith 4 White 2 Thomas 4 Williams 3 Thompson 4 Wilson 4 Wilson 4 keys are small integers Typical candidate for keyindexed counting 14Keyindexed counting demo Goal. Sort an array a of N integers between 0 and R 1. R = 6 Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. i ai 0 d int N = a.length; 1 a int count = new intR+1; use for a 0 2 c b for 1 3 f for (int i = 0; i N; i++) c for 2 4 f countai+1++; d for 3 e for 4 5 b f for 5 for (int r = 0; r R; r++) 6 d countr+1 += countr; 7 b 8 f for (int i = 0; i N; i++) 9 b auxcountai++ = ai; 10 e 11 a for (int i = 0; i N; i++) ai = auxi; 15Keyindexed counting demo Goal. Sort an array a of N integers between 0 and R 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. offset by 1 i ai stay tuned 0 d int N = a.length; 1 a int count = new intR+1; r countr 2 c 3 f a 0 for (int i = 0; i N; i++) count frequencies 4 f countai+1++; b 2 5 b c 3 for (int r = 0; r R; r++) 6 d d 1 countr+1 += countr; 7 b e 2 8 f f 1 for (int i = 0; i N; i++) 9 b 3 auxcountai++ = ai; 10 e 11 a for (int i = 0; i N; i++) ai = auxi; 16Keyindexed counting demo Goal. Sort an array a of N integers between 0 and R 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. i ai 0 d int N = a.length; 1 a int count = new intR+1; r countr 2 c 3 f a 0 for (int i = 0; i N; i++) 4 f countai+1++; b 2 5 b c 5 for (int r = 0; r R; r++) 6 d d 6 compute cumulates countr+1 += countr; 7 b e 8 8 f f 9 for (int i = 0; i N; i++) 9 b 12 auxcountai++ = ai; 10 e 11 a for (int i = 0; i N; i++) 6 keys d, 8 keys e ai = auxi; so d’s go in a6 and a7 17Keyindexed counting demo Goal. Sort an array a of N integers between 0 and R 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. i ai i auxi 0 a 0 d int N = a.length; 1 a 1 a int count = new intR+1; r countr 2 b 2 c 3 b 3 f a 2 for (int i = 0; i N; i++) 4 b 4 f countai+1++; b 5 5 c 5 b c 6 for (int r = 0; r R; r++) 6 6 d d d 8 countr+1 += countr; 7 d 7 b e 9 8 e 8 f f 12 for (int i = 0; i N; i++) move 9 f 9 b 12 items auxcountai++ = ai; 10 f 10 e 11 f 11 a for (int i = 0; i N; i++) ai = auxi; 18Keyindexed counting demo Goal. Sort an array a of N integers between 0 and R 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. i ai i auxi 0 a 0 a int N = a.length; 1 a 1 a int count = new intR+1; r countr 2 b 2 b 3 b 3 b a 2 for (int i = 0; i N; i++) 4 b 4 b countai+1++; b 5 5 c 5 c c 6 for (int r = 0; r R; r++) 6 6 d d d 8 countr+1 += countr; 7 d 7 d e 9 8 e 8 e f 12 for (int i = 0; i N; i++) 9 f 9 f 12 auxcountai++ = ai; 10 f 10 f 11 f 11 f for (int i = 0; i N; i++) copy back ai = auxi; 19Keyindexed counting: analysis Proposition. Keyindexed takes time proportional to N + R. for (int i = 0; i N; i++) auxcountai.key(d)++ = ai; Proposition. Keyindexed counting uses extra space proportional to N + R. count 1 2 3 4 0 3 8 14 Stable ✔ a0 aux0 Anderson 2 Harris 1 0 4 8 14 a1 aux1 Brown 3 Martin 1 0 4 9 14 a2 aux2 Davis 3 Moore 1 0 4 10 14 a3 aux3 Garcia 4 Anderson 2 0 4 10 15 aux4 a4 Harris 1 Martinez 2 1 4 10 15 a5 aux5 Jackson 3 Miller 2 1 4 11 15 aux6 a6 Johnson 4 Robinson 2 1 4 11 16 a7 aux7 Jones 3 White 2 1 4 12 16 a8 aux8 Martin 1 Brown 3 2 4 12 16 a9 aux9 Martinez 2 Davis 3 2 5 12 16 a10 aux10 Miller 2 Jackson 3 2 6 12 16 a11 aux11 Moore 1 Jones 3 3 6 12 16 a12 aux12 Robinson 2 Taylor 3 3 7 12 16 a13 aux13 Smith 4 Williams 3 3 7 12 17 a14 aux14 Taylor 3 Garcia 4 3 7 13 17 a15 aux15 Thomas 4 Johnson 4 3 7 13 18 a16 aux16 Thompson 4 Smith 4 3 7 13 19 a17 aux17 White 2 Thomas 4 3 8 13 19 a18 aux18 Williams 3 Thompson 4 3 8 14 19 a19 aux19 Wilson 4 Wilson 4 3 8 14 20 Distributing the data (records with key 3 highlighted) 205.1 STRING SORTS strings in Java ‣ keyindexed counting ‣ LSD radix sort ‣ Algorithms MSD radix sort ‣ 3way radix quicksort ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu suffix arrays ‣Leastsignificantdigitfirst string sort LSD string (radix) sort. Consider characters from right to left. th Stably sort using d character as the key (using keyindexed counting). sort key (d = 2) sort key (d = 1) sort key (d = 0) 0 0 0 0 d a b d a b d a b a c e 1 a d d 1 c a b 1 c a b 1 a d d 2 c a b 2 e b b 2 f a d 2 b a d 3 f a d 3 a d d 3 b a d 3 b e d 4 f e e 4 f a d 4 d a d 4 b e e 5 b a d 5 b a d 5 e b b 5 c a b 6 6 6 6 d a d d a d a c e d a b 7 b e e 7 f e d 7 a d d 7 d a d 8 f e d 8 b e d 8 f e d 8 e b b 9 b e d 9 f e e 9 b e d 9 f a d 10 e b b 10 b e e 10 f e e 10 f e d 11 a c e 11 a c e 11 b e e 11 f e e sort is stable (arrows do not cross) 22LSD string sort: correctness proof Proposition. LSD sorts fixedlength strings in ascending order. sort key Pf. by induction on i 0 0 d a b a c e After pass i, strings are sorted by last i characters. 1 c a b 1 a d d If two strings differ on sort key, 2 f a d 2 b a d keyindexed sort puts them in proper 3 b a d 3 b e d relative order. 4 d a d 4 b e e 5 e b b 5 c a b If two strings agree on sort key, 6 6 a c e d a b stability keeps them in proper relative order. 7 a d d 7 d a d 8 f e d 8 e b b 9 b e d 9 f a d 10 f e e 10 f e d 11 b e e 11 f e e Proposition. LSD sort is stable. sorted from Pf. Keyindexed counting is stable. previous passes (by induction) 23LSD string sort: Java implementation public class LSD fixedlength W strings public static void sort(String a, int W) radix R int R = 256; int N = a.length; String aux = new StringN; do keyindexed counting for (int d = W1; d = 0; d) for each digit from right to left int count = new intR+1; for (int i = 0; i N; i++) countai.charAt(d) + 1++; keyindexed counting for (int r = 0; r R; r++) countr+1 += countr; for (int i = 0; i N; i++) auxcountai.charAt(d)++ = ai; for (int i = 0; i N; i++) ai = auxi; 24Summary of the performance of sorting algorithms Frequency of operations. algorithm guarantee random extra space stable operations on keys 2 2 compareTo() ✔ insertion sort ½ N ¼ N 1 compareTo() ✔ mergesort N lg N N lg N N compareTo() quicksort 1.39 N lg N 1.39 N lg N c lg N compareTo() heapsort 2 N lg N 2 N lg N 1 † charAt() ✔ LSD sort 2 W (N + R) 2 W (N + R) N + R probabilistic † fixedlength W keys Q. What if strings are not all of same length 25String sorting interview question Problem. Sort one million 32bit integers. Ex. Google (or presidential) interview. Which sorting method to use Insertion sort. Mergesort. Quicksort. Heapsort. LSD string sort. 26String sorting interview question Google CEO Eric Schmidt interviews Barack Obama 27How to take a census in 1900s 1880 Census. Took 1500 people 7 years to manually process data. Herman Hollerith. Developed counting and sorting machine to automate. Use punch cards to record data (e.g., gender, age). Machine sorts one column at a time (into one of 12 bins). Typical question: how many women of age 20 to 30 Hollerith tabulating machine and sorter punch card (12 holes per column) 1890 Census. Finished in 1 year (and under budget) 28How to get rich sorting in 1900s Punch cards. 1900s to 1950s Also useful for accounting, inventory, and business processes. Primary medium for data entry, storage, and processing. Hollerith's company later merged with 3 others to form Computing Tabulating Recording Corporation (CTRC); company renamed in 1924. IBM 80 Series Card Sorter (650 cards per minute) 29LSD string sort: a moment in history (1960s) card punch punched cards card reader mainframe line printer not directly related To sort a card deck to sorting start on right column put cards into hopper machine distributes into bins pick up cards (stable) move left one column continue until sorted card sorter Lysergic Acid Diethylamide (Lucy in the Sky with Diamonds) 305.1 STRING SORTS strings in Java ‣ keyindexed counting ‣ LSD radix sort ‣ Algorithms MSD radix sort ‣ 3way radix quicksort ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu suffix arrays ‣Reverse LSD Consider characters from left to right. th Stably sort using d character as the key (using keyindexed counting). sort key (d = 0) sort key (d = 1) sort key (d = 2) 0 0 0 0 d a b a d d b a d c a b 1 a d d 1 a c e 1 c a b 1 d a b 2 c a b 2 b a d 2 d a b 2 e b b 3 f a d 3 b e e 3 d a d 3 b a d 4 f e e 4 b e d 4 f a d 4 d a d 5 b a d 5 c a b 5 e b b 5 f a d 6 6 6 6 d a d d a b a c e a d d 7 b e e 7 d a d 7 a d d 7 b e d 8 f e d 8 e b b 8 b e e 8 f e d 9 b e d 9 f a d 9 b e d 9 a c e 10 e b b 10 f e e 10 f e e 10 b e e 11 a c e 11 f e d 11 f e d 11 f e e not sorted 32Mostsignificantdigitfirst string sort MSD string (radix) sort. Partition array into R pieces according to first character (use keyindexed counting). Recursively sort all strings that start with each character (keyindexed counts delineate subarrays to sort). 0 a d d 1 a c e 0 0 d a b a d d count 1 a d d 1 a c e 2 b a d a 0 2 c a b 2 b a d 3 b e e b 2 3 f a d 3 b e e 4 b e d c 5 4 f e e 4 b e d sort subarrays d 6 5 b a d 5 c a b 5 c a b recursively e 8 6 6 d a d d a b f 6 d a b 9 7 b e e 7 d a d 7 d a d 12 8 f e d 8 e b b 9 b e d 9 f a d 8 e b b 10 e b b 10 f e e 9 f a d 11 a c e 11 f e d f e e 10 sort key 11 f e d 33MSD string sort: example input d are are are are are are are are she by by by by by by by by sells lo she sells seashells sea sea sea seas sea seashells sells seashells sea seashells seashells seashells seashells seashells by seashells sea seashells seashells seashells seashells seashells seashells the sea sells sells sells sells sells sells sells sea shore seashells sells sells sells sells sells sells shore shells she she she she she she she the she shore shore shore shore shore shore shore shells sells shells shells shells shells shells shore shells she surely she she she she she she she sells seashells surely surely surely surely surely surely surely are the the the the the the the the surely hi the the the the the the the the seashells end of string need to examine every character goes before any output in equal keys char value are are are are are are are are by by by by by by by by sea sea sea sea sea sea sea sea seashellsseashells seashells seashells seashells seashells seashells seashells seashellsseashells seashells seashells seashells seashells seashells seashells sells sells sells sells sells sells sells sells sells sells sells sells sells sells sells sells she she she she she she she she shore sshore shore shells she she she she shells shells hells shells she shells shells shells she she she shore shore shore shore shore surely surely surely surely surely surely surely surely the the the the the the the the the the the the the the the the Trace of recursive calls for MSD string sort (no cutoff for small subarrays, subarrays of size 0 and 1 omitted) 34Variablelength strings Treat strings as if they had an extra char at end (smaller than any char). why smaller 0 s e a 1 1 s e a s h e l l s 1 2 s e l l s 1 3 s h e 1 she before shells 4 s h e 1 5 s h e l l s 1 6 s h o r e 1 7 s u r e l y 1 private static int charAt(String s, int d) if (d s.length()) return s.charAt(d); else return 1; C strings. Have extra char '\0' at end ⇒ no extra work needed. 35MSD string sort: Java implementation public static void sort(String a) aux = new Stringa.length; recycles aux array sort(a, aux, 0, a.length 1, 0); but not count array private static void sort(String a, String aux, int lo, int hi, int d) if (hi = lo) return; keyindexed counting int count = new intR+2; for (int i = lo; i = hi; i++) countcharAt(ai, d) + 2++; for (int r = 0; r R+1; r++) countr+1 += countr; for (int i = lo; i = hi; i++) auxcountcharAt(ai, d) + 1++ = ai; for (int i = lo; i = hi; i++) ai = auxi lo; sort R subarrays recursively for (int r = 0; r R; r++) sort(a, aux, lo + countr, lo + countr+1 1, d+1); 36MSD string sort: potential for disastrous performance count Observation 1. Much too slow for small subarrays. Each function call needs its own count array. ASCII (256 counts): 100x slower than copy pass for N = 2. Unicode (65,536 counts): 32,000x slower for N = 2. Observation 2. Huge number of small subarrays because of recursion. a aux 0 b 0 a 1 a 1 b 37Cutoff to insertion sort Solution. Cutoff to insertion sort for small subarrays. th Insertion sort, but start at d character. private static void sort(String a, int lo, int hi, int d) for (int i = lo; i = hi; i++) for (int j = i; j lo less(aj, aj1, d); j) exch(a, j, j1); th Implement less() so that it compares starting at d character. private static boolean less(String v, String w, int d) for (int i = d; i Math.min(v.length(), w.length()); i++) if (v.charAt(i) w.charAt(i)) return true; if (v.charAt(i) w.charAt(i)) return false; return v.length() w.length(); 38 MSD string sort: performance Number of characters examined. MSD examines just enough characters to sort the keys. Number of characters examined depends on keys. Can be sublinear in input size Nonrandom compareTo() based sorts Random Worst case with duplicates (sublinear) (linear) can also be sublinear (nearly linear) are 1EIO402 1DNB377 by 1HYL490 1DNB377 sea 1ROZ572 1DNB377 seashells 1DNB377 2HXE734 seashells 1DNB377 2IYE230 sells 1DNB377 2XOR846 sells 1DNB377 3CDB573 she 3CVP720 1DNB377 she 3IGJ319 1DNB377 shells 3KNA382 1DNB377 shore 3TAV879 1DNB377 surely 4CQP781 1DNB377 the 4QGI284 1DNB377 the 4YHV229 1DNB377 Characters examined by MSD string sort 39Summary of the performance of sorting algorithms Frequency of operations. algorithm guarantee random extra space stable operations on keys 2 2 compareTo() ✔ insertion sort ½ N ¼ N 1 compareTo() ✔ mergesort N lg N N lg N N compareTo() quicksort 1.39 N lg N 1.39 N lg N c lg N compareTo() heapsort 2 N lg N 2 N lg N 1 † charAt() ✔ LSD sort 2 W (N + R) 2 W (N + R) N + R ‡ charAt() ✔ MSD sort 2 W (N + R) N log N N + D R R probabilistic D = functioncall stack depth † fixedlength W keys (length of longest prefix match) ‡ averagelength W keys 40MSD string sort vs. quicksort for strings Disadvantages of MSD string sort. Extra space for aux. Extra space for count. Inner loop has a lot of instructions. Accesses memory "randomly" (cache inefficient). Disadvantage of quicksort. Linearithmic number of string compares (not linear). Has to rescan many characters in keys with long prefix matches. doesn't rescan tight inner loop, characters cache friendly Goal. Combine advantages of MSD and quicksort. 41Engineering a radix sort (American flag sort) Optimization 0. Cutoff to insertion sort. Optimization 1. Replace recursion with explicit stack. Push subarrays to be sorted onto stack. Now, one count array suffices. Optimization 2. Do Rway partitioning in place. Radix Engineering Sort Bostic Peter M. Mcllroy and Keith Eliminates aux array. University of California at Berkeley; and M. Douglas Mcllroy Sacrifices stability. ATT Bell Laboratories have excellent ABSTRACT Radix sorting methods asymptotic performance on string data, for which com parison operation. Attractive for use is not a unittime in large byteaddressable memories, these methods long been eclipsed by more easily have nevertheless prograÍrmed algorithms. Three ways to sort strings by righta stable list sort, a stable twoarray bytes left to "American illus sort, and an inplace flag" sor¿are with practical C programs. For heavyduty sort trated perform usually running at ing, all three comparably, quicksort. least twice as fast as a good We recommend American national flag problem Dutch national flag problem for general use. American flag sort 42 . . Systems, Vol. 6 No. 1 Winter 1993 Computing 5.1 STRING SORTS strings in Java ‣ keyindexed counting ‣ LSD radix sort ‣ Algorithms MSD radix sort ‣ 3way radix quicksort ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu suffix arrays ‣3way string quicksort (Bentley and Sedgewick, 1997) th Overview. Do 3way partitioning on the d character. Less overhead than Rway partitioning in MSD string sort. Does not reexamine characters equal to the partitioning char. (but does reexamine characters not equal to the partitioning char) she by partitioning item sells are use first character to seashells seashells partition into "less", "equal", and "greater" by she subarrays the seashells recursively sort subarrays, sea sea excluding first character for middle subarray shore shore the surely shells shells she she sells sells are sells surely the seashells the 443way string quicksort: trace of recursive calls partitioning item she by are are are sells are by by by seashells seashells seashells seashells seashells by she she sells sea the seashells seashells seashells seashells sea sea sea sea sells shore shore shore sells sells the surely surely shells shells shells shells shells she she she she she surely surely sells sells sells shore shore are sells sells she she surely the the the the seashells the the the the Trace of first few recursive calls for 3way string quicksort (subarrays of size 1 not shown) 453way string quicksort: Java implementation private static void sort(String a) sort(a, 0, a.length 1, 0); private static void sort(String a, int lo, int hi, int d) if (hi = lo) return; 3way partitioning th (using d character) int lt = lo, gt = hi; int v = charAt(alo, d); int i = lo + 1; while (i = gt) to handle variablelength strings int t = charAt(ai, d); if (t v) exch(a, lt++, i++); else if (t v) exch(a, i, gt); else i++; sort(a, lo, lt1, d); sort 3 subarrays recursively if (v = 0) sort(a, lt, gt, d+1); sort(a, gt+1, hi, d); 463way string quicksort vs. standard quicksort Standard quicksort. Uses 2 N ln N string compares on average. Costly for keys with long common prefixes (and this is a common case) 3way string (radix) quicksort. Uses 2 N ln N character compares on average for random strings. Avoids recomparing long common prefixes. Fast Algorithms for Sorting and Searching Strings Jon L. Bentley Robert Sedgewick Abstract that is competitive with the most efficient string sorting programs known. The second program is a symbol table We present theoretical algorithms for sorting and implementation that is faster than hashing, which is com searching multikey data, and derive from them practical C monly regarded as the fastest symbol table implementa implementations for applications in which keys are charac tion. The symbol table implementation is much more ter strings. The sorting algorithm blends Quicksort and spaceefficient than multiway trees, and supports more radix sort; it is competitive with the best known C sort advanced searches. codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly In many application programs, sorts use a Quicksort used search methods. The basic ideas behind the algo implementation based on an abstract compare operation, rithms date back at least to the 1960s but their practical and searches use hashing or binary search trees. These do utility has been overlooked. We also present extensions to not take advantage of the properties of string keys, which more complex string problems, such as partialmatch are widely used in practice. Our algorithms provide a nat 47 searching. ural and elegant way to adapt classical algorithms to this important class of applications. 1. Introduction Section 6 turns to more difficult stringsearching prob lems. Partialmatch queries allow “don’t care” characters Section 2 briefly reviews Hoare’s 9 Quicksort and (the pattern “so.a”, for instance, matches soda and sofa). binary search trees. We emphasize a wellknown isomor The primary result in this section is a ternary search tree phism relating the two, and summarize other basic facts. implementation of Rivest’s partialmatch searching algo The multikey algorithms and data structures are pre rithm, and experiments on its performance. “Near neigh sented in Section 3. Multikey Quicksort orders a set of II bor” queries locate all words within a given Hamming dis vectors with k components each. Like regular Quicksort, it tance of a query word (for instance, code is distance 2 partitions its input into sets less than and greater than a from soda). We give a new algorithm for near neighbor given value; like radix sort, it moves on to the next field searching in strings, present a simple C implementation, once the current input is known to be equal in the given and describe experiments on its efficiency. field. A node in a ternary search tree represents a subset of Conclusions are offered in Section 7. vectors with a partitioning value and three pointers: one to lesser elements and one to greater elements (as in a binary 2. Background search tree) and one to equal elements, which are then pro cessed on later fields (as in tries). Many of the structures Quicksort is a textbook divideandconquer algorithm. and analyses have appeared in previous work, but typically To sort an array, choose a partitioning element, permute as complex theoretical constructions, far removed from the elements such that lesser elements are on one side and practical applications. Our simple framework opens the greater elements are on the other, and then recursively sort door for later implementations. the two subarrays. But what happens to elements equal to the partitioning value Hoare’s partitioning method is The algorithms are analyzed in Section 4. Many of the binary: it places lesser elements on the left and greater ele analyses are simple derivations of old results. ments on the right, but equal elements may appear on Section 5 describes efficient C programs derived from either side. the algorithms. The first program is a sorting algorithm Algorithm designers have long recognized the desir irbility and difficulty of a ternary partitioning method. Bell Labs, Lucent Technologies, 700 Mountam Avenue, Murray Hill. NJ 07974; jlbresearch.belllabs.com. Sedgewick 22 observes on page 244: “Ideally, we would Princeton University. Princeron. NJ. 08514: rscs.princeton.edu. llke to get all equal keys1 into position in the file, with all 360 3way string quicksort vs. MSD string sort MSD string sort. Is cacheinefficient. Too much memory storing count. Too much overhead reinitializing count and aux. 3way string quicksort. Is cachefriendly. Is inplace. Has a short inner loop. library of Congress call numbers Bottom line. 3way string quicksort is method of choice for sorting strings. 48Summary of the performance of sorting algorithms Frequency of operations. algorithm guarantee random extra space stable operations on keys 2 2 ✔ compareTo() insertion sort ½ N ¼ N 1 ✔ compareTo() mergesort N lg N N lg N N compareTo() quicksort 1.39 N lg N 1.39 N lg N c lg N compareTo() heapsort 2 N lg N 2 N lg N 1 † ✔ charAt() LSD sort 2 W (N + R) 2 W (N + R) N + R ‡ ✔ charAt() MSD sort 2 W (N + R) N log N N + D R R 3way string charAt() 1.39 W N lg R 1.39 N lg N log N + W quicksort probabilistic † fixedlength W keys ‡ averagelength W keys 495.1 STRING SORTS strings in Java ‣ keyindexed counting ‣ LSD radix sort ‣ Algorithms MSD radix sort ‣ 3way radix quicksort ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu suffix arrays ‣Keywordincontext search Given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context). more tale.txt it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness it was the epoch of belief it was the epoch of incredulity it was the season of light it was the season of darkness it was the spring of hope it was the winter of despair ⋮ Applications. Linguistics, databases, web search, word processing, …. 51Keywordincontext search Given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context). characters of java KWIC tale.txt 15 surrounding context search o st giless to search for contraband her unavailing search for your fathe le and gone in search of her husband t provinces in search of impoverishe dispersing in search of other carri n that bed and search the straw hold better thing t is a far far better thing that i do than some sense of better things else forgotte was capable of better things mr carton ent Applications. Linguistics, databases, web search, word processing, …. 52Suffix sort input string i t w a s b e s t i t w a s w 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 form suffixes sort suffixes to bring query strings together 0 3 i t w a s b e s t i t w a s w a s b e s t 1 12 t w a s b e s t i t w a s w a s w 2 5 w a s b e s t i t w a s w b e s t i t w a s w 3 6 a s b e s t i t w a s w e s t i t w a s w 4 0 s b e s t i t w a s w i t w a s b e s t i t w a s w 5 9 b e s t i t w a s w i t w a s w 6 4 e s t i t w a s w s b e s t i t w a s w 7 7 s t i t w a s w s t i t w a s w 8 13 t i t w a s w s w 9 8 i t w a s w t i t w a s w 10 1 t w a s w t w a s b e s t i t w a s w 11 10 w a s w t w a s w 12 14 a s w w 13 2 s w w a s b e s t i t w a s w 14 11 w w a s w array of suffix indices in sorted order 53Keywordincontext search: suffixsorting solution Preprocess: suffix sort the text. Query: binary search for query; scan until mismatch. KWIC search for "search" in Tale of Two Cities ⋮ 632698 s e a l e d m y l e t t e r a n d … 713727 s e a m s t r e s s i s l i f t e d … 660598 s e a m s t r e s s o f t w e n t y … 67610 s e a m s t r e s s w h o w a s w i … 4430 s e a r c h f o r c o n t r a b a n d … 42705 s e a r c h f o r y o u r f a t h e … 499797 s e a r c h o f h e r h u s b a n d … 182045 s e a r c h o f i m p o v e r i s h e … 143399 s e a r c h o f o t h e r c a r r i … 411801 s e a r c h t h e s t r a w h o l d … 158410 s e a r e d m a r k i n g a b o u t … 691536 s e a s a n d m a d a m e d e f a r … 536569 s e a s e a t e r r i b l e p a s s … 484763 s e a s e t h a t h a d b r o u g h … ⋮ 54War story Q. How to efficiently form (and sort) suffixes String suffixes = new StringN; for (int i = 0; i N; i++) suffixesi = s.substring(i, N); Algorithms FOUR TH EDITION Arrays.sort(suffixes); ROBERT SEDGEWICK KEVIN WAYNE rd 3 printing input file characters Java 7u4 Java 7u5 amendments.txt 18 thousand 0.25 sec 2.0 sec aesop.txt 192 thousand 1.0 sec out of memory mobydick.txt 1.2 million 7.6 sec out of memory chromosome11.txt 7.1 million 61 sec out of memory 55The String data type: Java 7u5 implementation public final class String implements ComparableString private char value; // characters private int offset; // index of first char in array private int length; // length of string private int hash; // cache of hashCode() … length = 12 String s = "Hello, World" value H E L L O , W O R L D 0 1 2 3 4 5 6 7 8 9 10 11 offset = 0 length = 5 String t = s.substring(7, 12); value H E L L O , W O R L D 0 1 2 3 4 5 6 7 8 9 10 11 offset = 7 56The String data type: Java 7u6 implementation public final class String implements ComparableString private char value; // characters private int hash; // cache of hashCode() … String s = "Hello, World" value H E L L O , W O R L D 0 1 2 3 4 5 6 7 8 9 10 11 String t = s.substring(7, 12); value W O R L D 0 1 2 3 4 57The String data type: performance String data type (in Java). Sequence of characters (immutable). Java 7u5. Immutable char array, offset, length, hash cache. Java 7u6. Immutable char array, hash cache. operation Java 7u5 Java 7u6 length 1 1 indexing 1 1 substring extraction 1 N concatenation M + N M + N ✔ ✔ immutable memory 64 + 2N 56 + 2N 58A Reddit exchange I'm the author of the substring() change. As has been suggested in the analysis here there were two motivations for the change Reduce the size of String instances. Strings • are typically 2040 of common apps footprint. bondolo Avoid memory leakage caused by retained • substrings holding the entire character array. Changing this function, in a bugfix release no less, was totally irresponsible. It broke backwards compatibility for numerous applications with errors that didn't even produce a message, just freezing and timeouts... All pain, no gain. Your work was not just vain, it was thoroughly destructive, even cypherpunks beyond its immediate effect. http://www.reddit.com/r/programming/comments/1qw73v/tiloraclechangedtheinternalstring 59Suffix sort Q. How to efficiently form (and sort) suffixes in Java 7u6 A. Define Suffix class ala Java 7u5 String. public class Suffix implements ComparableSuffix private final String text; private final int offset; public Suffix(String s, int offset) this.text = text; this.offset = offset; public int length() return text.length() offset; public char charAt(int i) return text.charAt(offset + i); public int compareTo(Suffix that) / see textbook / H E L L O , W O R L D text 0 1 2 3 4 5 6 7 8 9 10 11 offset 60Suffix sort Q. How to efficiently form (and sort) suffixes in Java 7u6 A. Define Suffix class ala Java 7u5 String. String suffixes = new StringN; for (int i = 0; i N; i++) suffixesi = new Suffix(s, i); Algorithms FOUR TH EDITION Arrays.sort(suffixes); ROBERT SEDGEWICK KEVIN WAYNE th 4 printing 61Lessons learned Lesson 1. Put performance guarantees in API. Lesson 2. If API has no performance guarantees, don't rely upon any Corollary. May want to avoid String data type for huge strings. Are you sure charAt() and length() take constant time If lots of calls to charAt(), overhead for function calls is large. If lots of small strings, memory overhead of String is large. Ex. Our optimized algorithm for suffix arrays is 5x faster and uses 32x less memory than our original solution in Java 7u5 62Suffix Arrays: theory Q. What is worstcase running time of our suffix arrays algorithm Quadratic. Linearithmic. Linear. 2 N log N None of the above. suffixes 0 a a a a a a a a a a 1 a a a a a a a a a 2 a a a a a a a a 3 a a a a a a a 4 a a a a a a 5 a a a a a 6 a a a a 7 a a a 8 a a 9 a 63Suffix Arrays: theory Q. What is complexity of suffix arrays Quadratic. ManberMyers algorithm (see video) Linearithmic. suffix trees (beyond our scope) Linear. ✓ Nobody knows. Suffix arrays: A new method for online string searches 1 UdiManber 2 Gene Myers LINEAR PATTERN MATCHING ALGORITHMS Department ofComputerScience UniversityofArizona Peter Weiner Tucson,AZ85721 The Rand Corporation, Santa Monica, California May1989 RevisedAugust1991 Abstract Abstract In 1970, Knuth, Pratt, and Morris 1 showed how to do basic pattern matching in linear time. Related problems, such as those discussed in 4, have pre viously been solved by efficient but suboptimal algorithms. In this paper, we A new and conceptually simple data structure, called a suffix array, for online string searches is intro introduce an interesting data structure called a bitree. A linear time algo duced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that rithm "for obtaining a compacted version of a bitree associated with a given employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they string is presented. With this construction as the basic tool, we indicate how use three to five times less space. From a complexity standpoint, suffix arrays permit online string to solve several pattern matching problems, including some from 4, in linear time. searches of the type, ‘‘Is W a substring of A’’ to be answered in time O(P + log N), where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, Before giving a formal definition of a bitree, we re I. Introduction view basic definitions and terminology concerning tary suffix trees can be constructed in O(N) time in the worst case, versus O(N log N) time for suffix arrays. trees. (See Knuth 7 for further details.) In 1970, Knuth, Morris, and Pratt 12 showed how to However, we give an augmented algorithm that, regardless of the alphabet size, constructs suffix arrays in A tary tpee T over = al,... ,a is a set of match a given pattern into another given string in time t O(N) expected time, albeit with lesser space efficiency. We believe that suffix arrays will prove to be proportional to the sum of the lengths of the pattern nodes N which is either empty or consists of a poot, better in practice than suffix trees for many applications. and string. Their algorithm was derived from a result nO EN, and t ordered, disjoint tarY trees. of Cook 3 that the 2way deterministic pushdown lan Clearly, every node n E N is the root of some i guages are recognizable on a random access machine in i tary tree T which itself consists of n1 and t ordered, time O(n). Since 1970, attention has been given to 1. Introduction iii 64 several related problems in pattern matching 46, but disjoint tary trees, say T , T , T • We call the Finding all instances of a string W in a large text A is an important pattern matching problem. There are l 2 t the algorithms developed in these investigations us iii tree T a subtpee of T; also,.all subtrees of T are many applications in which a fixed text is queried many times. In these cases, it is worthwhile to construct ually run in time which is slightly worse than linear, j j 1 for example O(n log n). It is of considerable interest a data structure to allow fast queries. The Suffix tree is a data structure that admits efficient online string considered to be subtrees of T • It is natural to to either establish that there exists a nonlinear associate with a tree Ta successor function searches. A suffix tree for a text A of length N over an alphabet can be built in O(N log ) time and lower bound on the run time of all algorithms which O(N) space Wei73, McC76. Suffix trees permit online string searches of the type, ‘‘Is W a substring of solve a given pattern matching problem, or to exhibit S: NX (Nn ) UNIL O A’’ to be answered in O(P log ) time, where P is the length of W. We explicitly consider the an algorithm whose run time is of O(n). In the following sections, we introduce an inter defined for n ENand a E L by i j 1 esting data structure, called a bitree, and show how Supported in part by an NSF Presidential Young Investigator Award (grant DCR8451397), with matching funds from ATT, and an efficient calculation of a bitree can be applied to byanNSF grantCCR9002351. i n , the root of if is nonempty the lineartime (and linearspace) solution of several 2 SupportedinpartbytheNIH(grantR01LM0496001),andbyanNSF grantCCR9002351. pattern matching problems. s(ni'Oj) = NIL if is empty. II. Strings, Trees, and BiTrees It is easily seen that this function completely deter mines a tary tree and we write T = (N, nO'S). In this paper, both patterns and strings are finite length, fully specified sequences of symbols over a If n' = S(n,a), we say that nand n' are connected f finite alphabet = a ,a ,... ,a . Such a pattern of by a bpanah from n to n which has a label of o. wet l 2 t call n' a son of n, and n the father of n'. The degree length mwill be denoted as of a node n is the number of sons of that node, that is, the number of distinct a for which S(n,a) NIL. A node P = P(1) P(2) ... P(m), of degree 0 is a leaf of the tree. th It is useful to extend the domain of S from Nx where P(i), an element of , is the i symbol in the to (N U NIL) x (and extend the range to include th sequence, and is said to be located in the i position. nO) by the inductive definition To represent the substring of characters which begins at position i of P and ends at position j, we write (Sl) S(NIL,w) NIL for all w E P(i:j). That is, when i j, P(i:j) = P(i) ... P(j ), (S2) S(n,A) = n for all nEN and P(i:j) = A, the null string, for i j. Let denote the set of all finite length strings (S3) S(n,u.xJ) = S(S(n,w),a) for all n EN, w E L, over . strings WI and w in may be combined by and a E L:. 2 the operation of concatenation to form a new string Not every S: Nx (Nn ) U NIL is the successor W = WI w . The reverse of a string P = A(1) ... A(m) O 2 function of a tary tree. But a necessary and suffi is the string pr = A(m) ... A (1 ). cient condition for S to be a successor function of The length of a string or pattern, denoted by 19(w) some (unique, if it exists) tary tree can be expressed for W E , is the number of symbols in the sequence. in terms of the extended S. Namely, that there exists For example, 19(P(i:j» = ji+l if i j and is 0 if exactly one choice of w such that S(nO'w n for every i j. Informally, a bitree over can be thought of as n E N. there exists a T such that T = (N,nO'S), two related tary trees sharing a common node set. we say that S is We may also associate with T a father function F: N N defined by F(n ) = nO and for n' E Nn ' This work was partially supported by grants from O O the Alfred P. Sloan Foundation and the Exxon Education Foundation. P. Weiner was at Yale University when this F(n') = n ¢) S(n ,a) = n' for sorne a E . work was done.Suffix Arrays: practice Applications. Bioinformatics, information retrieval, data compression, … Many ingenious algorithms. Memory footprint very important. Stateofthe art still changing. year algorithm worst case memory 1990 ManberMyers N log N 8 N 1999 LarssonSadakane N log N 8 N 2003 KärkkäinenSanders N 13 N 2003 KoAluru N 10 N 2008 divsufsort2 N log N 5 N good choices (Yuta Mori) 2010 sais N 6 N 65String sorting summary We can develop lineartime sorts. Key compares not necessary for string keys. Use characters as index in an array. We can develop sublineartime sorts. Input size is amount of data in keys (not number of keys). Not all of the data has to be examined. 3way string quicksort is asymptotically optimal. 1.39 N lg N chars for random data. Long strings are rarely random in practice. Goal is often to learn the structure May need specialized algorithms. 66
Website URL
Comment
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
Dr.BenjaminClark
User Type:
Teacher
Country:
United States
Uploaded Date:
21-07-2017