Question? Leave a message!




HASH TABLES

HASH TABLES
Dr.BenjaminClark Profile Pic
Dr.BenjaminClark,United States,Teacher
Published Date:21-07-2017
Website URL
Comment
ROBERT SEDGEWICK KEVIN WAYNE Algorithms 3.4 HASH TABLES hash functions ‣ separate chaining ‣ linear probing ‣ Algorithms FOUR TH EDITION context ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduSymbol table implementations: summary guarantee guarantee guarantee average case average case average case or order dered ed key key implementation implementation ops? ops? interface interface search insert delete search hit insert delete sequential search equals() N N N ½ N N ½ N (unordered list) binary search compareTo() ✔ lg N N N lg N ½ N ½ N (ordered array) compareTo() BST ✔ N N N 1.39 lg N 1.39 lg N √ N compareTo() red-black BST ✔ 2 lg N 2 lg N 2 lg N 1.0 lg N 1.0 lg N 1.0 lg N Q. Can we do better? A. Yes, but with different access to the data. 2Hashing: basic plan Save items in a key-indexed table (index is a function of the key). 0 Hash function. Method for computing array index from key. 1 2 hash("it") = 3 3 "it" ?? 4 Issues. hash("times") = 3 5 Computing the hash function. Equality test: Method for checking whether two keys are equal. Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index. Classic space-time tradeoff. No space limitation: trivial hash function with key as index. No time limitation: trivial collision resolution with sequential search. Space and time limitations: hashing (the real world). 33.4 HASH TABLES hash functions ‣ separate chaining ‣ linear probing ‣ Algorithms context ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduComputing the hash function Idealistic goal. Scramble the keys uniformly to produce a table index. Efficiently computable. key Each table index equally likely for each key. thoroughly researched problem, still problematic in practical applications Ex 1. Phone numbers. Bad: first three digits. table Better: last three digits. index Ex 2. Social Security numbers. 573 = California, 574 = Alaska Bad: first three digits. (assigned in chronological order within geographic region) Better: last three digits. Practical challenge. Need different approach for each key type. 5Java’s hash code conventions All Java classes inherit a method hashCode(), which returns a 32-bit int. Requirement. If x.equals(y), then (x.hashCode() == y.hashCode()). Highly desirable. If x.equals(y), then (x.hashCode() = y.hashCode()). x y y.hashCode() x.hashCode() Default implementation. Memory address of x. Legal (but poor) implementation. Always return 17. Customized implementations. Integer, Double, String, File, URL, Date, … User-defined types. Users are on their own. 6Implementing hash code: integers, booleans, and doubles Java library implementations public final class Integer public final class Double private final int value; private final double value; ... ... public int hashCode() public int hashCode() return value; long bits = doubleToLongBits(value); return (int) (bits (bits 32)); public final class Boolean private final boolean value; convert to IEEE 64-bit representation; ... xor most significant 32-bits with least significant 32-bits public int hashCode() Warning: -0.0 and +0.0 have different hash codes if (value) return 1231; else return 1237; 7Implementing hash code: strings Java library implementation char Unicode public final class String … … private final char s; 'a' 97 ... 'b' 98 public int hashCode() 'c' 99 int hash = 0; … ... for (int i = 0; i length(); i++) hash = si + (31 hash); return hash; th i character of s Horner's method to hash string of length L: L multiplies/adds. L–1 2 1 0 Equivalent to h = s0 · 31 + … + sL – 3 · 31 + sL – 2 · 31 + sL – 1 · 31 . String s = "call"; Ex. 3 2 1 0 3045982 = 99·31 + 97·31 + 108·31 + 108·31 int code = s.hashCode(); = 108 + 31· (108 + 31 · (97 + 31 · (99))) (Horner's method) 8Implementing hash code: strings Performance optimization. Cache the hash value in an instance variable. Return cached value. public final class String cache of hash code private int hash = 0; private final char s; ... public int hashCode() int h = hash; return cached value if (h = 0) return h; for (int i = 0; i length(); i++) h = si + (31 h); store cache of hash code hash = h; return h; Q. What if hashCode() of string is 0? 9Implementing hash code: user-defined types public final class Transaction implements ComparableTransaction private final String who; private final Date when; private final double amount; public Transaction(String who, Date when, double amount) / as before / ... public boolean equals(Object y) / as before / public int hashCode() nonzero constant for reference types, int hash = 17; use hashCode() hash = 31hash + who.hashCode(); hash = 31hash + when.hashCode(); for primitive types, hash = 31hash + ((Double) amount).hashCode(); use hashCode() return hash; of wrapper type typically a small prime 10Hash code design "Standard" recipe for user-defined types. Combine each significant field using the 31x + y rule. If field is a primitive type, use wrapper type hashCode(). If field is null, return 0. applies rule recursively If field is a reference type, use hashCode(). If field is an array, apply to each entry. or use Arrays.deepHashCode() In practice. Recipe works reasonably well; used in Java libraries. In theory. Keys are bitstring; "universal" hash functions exist. Basic rule. Need to use the whole key to compute hash code; consult an expert for state-of-the-art hash codes. 11Modular hashing 31 31 Hash code. An int between -2 and 2 - 1. Hash function. An int between 0 and M - 1 (for use as array index). typically a prime or power of 2 x private int hash(Key key) return key.hashCode() % M; bug x.hashCode() private int hash(Key key) return Math.abs(key.hashCode()) % M; 1-in-a-billion bug hash(x) 31 hashCode() of "polygenelubricants" is -2 private int hash(Key key) return (key.hashCode() & 0x7fffffff) % M; correct 12Uniform hashing assumption Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M - 1. Bins and balls. Throw balls uniformly at random into M bins. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Birthday problem. Expect two balls in the same bin after π M / 2 tosses. Coupon collector. Expect every bin has ≥ 1 ball after M ln M tosses. Load balancing. After M tosses, expect most loaded bin has Θ ( log M / log log M ) balls. 13Uniform hashing assumption Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M - 1. Bins and balls. Throw balls uniformly at random into M bins. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Hash value frequencies for words in Tale of Two Cities (M = 97) Java's String data uniformly distribute the keys of Tale of Two Cities 143.4 HASH TABLES hash functions ‣ separate chaining ‣ linear probing ‣ Algorithms context ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduCollisions Collision. Two distinct keys hashing to same index. Birthday problem ⇒ can't avoid collisions unless you have a ridiculous (quadratic) amount of memory. Coupon collector + load balancing ⇒ collisions are evenly distributed. 0 1 hash("it") = 3 2 3 "it" ?? 4 hash("times") = 3 5 Challenge. Deal with collisions efficiently. 16Separate-chaining symbol table Use an array of M N linked lists. H. P. Luhn, IBM 1953 Hash: map key to integer i between 0 and M - 1. th Insert: put at front of i chain (if not already there). th Search: need to search only i chain. key value hash S 2 0 E 0 1 A 8 E 12 A 0 2 R 4 3 st null 0 C 4 4 1 H 4 5 2 X 7 S 0 E 0 6 3 X 2 7 4 L 11 P 10 A 0 8 M 4 9 P 3 10 M 9 H 5 C 4 R 3 L 3 11 E 0 12 17 Hashing with separate chaining for standard indexing client Separate-chaining symbol table: Java implementation public class SeparateChainingHashSTKey, Value array doubling and private int M = 97; // number of chains halving code omitted private Node st = new NodeM; // array of chains private static class Node private Object key; no generic array creation (declare key and value of type Object) private Object val; private Node next; ... private int hash(Key key) return (key.hashCode() & 0x7fffffff) % M; public Value get(Key key) int i = hash(key); for (Node x = sti; x = null; x = x.next) if (key.equals(x.key)) return (Value) x.val; return null; 18Separate-chaining symbol table: Java implementation public class SeparateChainingHashSTKey, Value private int M = 97; // number of chains private Node st = new NodeM; // array of chains private static class Node private Object key; private Object val; private Node next; ... private int hash(Key key) return (key.hashCode() & 0x7fffffff) % M; public void put(Key key, Value val) int i = hash(key); for (Node x = sti; x = null; x = x.next) if (key.equals(x.key)) x.val = val; return; sti = new Node(key, val, sti); 19Analysis of separate chaining Proposition. Under uniform hashing assumption, prob. that the number of keys in a list is within a constant factor of N / M is extremely close to 1. Pf sketch. Distribution of list size obeys a binomial distribution. (10, .12511...) .125 0 30 0 10 20 4 3 Binomial distribution (N = 10 , M = 10 , = 10) equals() and hashCode() Consequence. Number of probes for search/insert is proportional to N / M. M too large ⇒ too many empty chains. M times faster than M too small ⇒ chains too long. sequential search Typical choice: M N / 4 ⇒ constant-time ops. 20