Question? Leave a message!




HASH TABLES

HASH TABLES
ROBERT SEDGEWICK KEVIN WAYNE Algorithms 3.4 HASH TABLES hash functions ‣ separate chaining ‣ linear probing ‣ Algorithms FOUR TH EDITION context ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduSymbol table implementations: summary guarantee guarantee guarantee average case average case average case or order dered ed key key implementation implementation ops ops interface interface search insert delete search hit insert delete sequential search equals() N N N ½ N N ½ N (unordered list) binary search compareTo() ✔ lg N N N lg N ½ N ½ N (ordered array) compareTo() BST ✔ N N N 1.39 lg N 1.39 lg N √ N compareTo() redblack BST ✔ 2 lg N 2 lg N 2 lg N 1.0 lg N 1.0 lg N 1.0 lg N Q. Can we do better A. Yes, but with different access to the data. 2Hashing: basic plan Save items in a keyindexed table (index is a function of the key). 0 Hash function. Method for computing array index from key. 1 2 hash("it") = 3 3 "it" 4 Issues. hash("times") = 3 5 Computing the hash function. Equality test: Method for checking whether two keys are equal. Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index. Classic spacetime tradeoff. No space limitation: trivial hash function with key as index. No time limitation: trivial collision resolution with sequential search. Space and time limitations: hashing (the real world). 33.4 HASH TABLES hash functions ‣ separate chaining ‣ linear probing ‣ Algorithms context ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduComputing the hash function Idealistic goal. Scramble the keys uniformly to produce a table index. Efficiently computable. key Each table index equally likely for each key. thoroughly researched problem, still problematic in practical applications Ex 1. Phone numbers. Bad: first three digits. table Better: last three digits. index Ex 2. Social Security numbers. 573 = California, 574 = Alaska Bad: first three digits. (assigned in chronological order within geographic region) Better: last three digits. Practical challenge. Need different approach for each key type. 5Java’s hash code conventions All Java classes inherit a method hashCode(), which returns a 32bit int. Requirement. If x.equals(y), then (x.hashCode() == y.hashCode()). Highly desirable. If x.equals(y), then (x.hashCode() = y.hashCode()). x y y.hashCode() x.hashCode() Default implementation. Memory address of x. Legal (but poor) implementation. Always return 17. Customized implementations. Integer, Double, String, File, URL, Date, … Userdefined types. Users are on their own. 6Implementing hash code: integers, booleans, and doubles Java library implementations public final class Integer public final class Double private final int value; private final double value; ... ... public int hashCode() public int hashCode() return value; long bits = doubleToLongBits(value); return (int) (bits (bits 32)); public final class Boolean private final boolean value; convert to IEEE 64bit representation; ... xor most significant 32bits with least significant 32bits public int hashCode() Warning: 0.0 and +0.0 have different hash codes if (value) return 1231; else return 1237; 7Implementing hash code: strings Java library implementation char Unicode public final class String … … private final char s; 'a' 97 ... 'b' 98 public int hashCode() 'c' 99 int hash = 0; … ... for (int i = 0; i length(); i++) hash = si + (31 hash); return hash; th i character of s Horner's method to hash string of length L: L multiplies/adds. L–1 2 1 0 Equivalent to h = s0 · 31 + … + sL – 3 · 31 + sL – 2 · 31 + sL – 1 · 31 . String s = "call"; Ex. 3 2 1 0 3045982 = 99·31 + 97·31 + 108·31 + 108·31 int code = s.hashCode(); = 108 + 31· (108 + 31 · (97 + 31 · (99))) (Horner's method) 8Implementing hash code: strings Performance optimization. Cache the hash value in an instance variable. Return cached value. public final class String cache of hash code private int hash = 0; private final char s; ... public int hashCode() int h = hash; return cached value if (h = 0) return h; for (int i = 0; i length(); i++) h = si + (31 h); store cache of hash code hash = h; return h; Q. What if hashCode() of string is 0 9Implementing hash code: userdefined types public final class Transaction implements ComparableTransaction private final String who; private final Date when; private final double amount; public Transaction(String who, Date when, double amount) / as before / ... public boolean equals(Object y) / as before / public int hashCode() nonzero constant for reference types, int hash = 17; use hashCode() hash = 31hash + who.hashCode(); hash = 31hash + when.hashCode(); for primitive types, hash = 31hash + ((Double) amount).hashCode(); use hashCode() return hash; of wrapper type typically a small prime 10Hash code design "Standard" recipe for userdefined types. Combine each significant field using the 31x + y rule. If field is a primitive type, use wrapper type hashCode(). If field is null, return 0. applies rule recursively If field is a reference type, use hashCode(). If field is an array, apply to each entry. or use Arrays.deepHashCode() In practice. Recipe works reasonably well; used in Java libraries. In theory. Keys are bitstring; "universal" hash functions exist. Basic rule. Need to use the whole key to compute hash code; consult an expert for stateoftheart hash codes. 11Modular hashing 31 31 Hash code. An int between 2 and 2 1. Hash function. An int between 0 and M 1 (for use as array index). typically a prime or power of 2 x private int hash(Key key) return key.hashCode() M; bug x.hashCode() private int hash(Key key) return Math.abs(key.hashCode()) M; 1inabillion bug hash(x) 31 hashCode() of "polygenelubricants" is 2 private int hash(Key key) return (key.hashCode() 0x7fffffff) M; correct 12Uniform hashing assumption Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M 1. Bins and balls. Throw balls uniformly at random into M bins. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Birthday problem. Expect two balls in the same bin after π M / 2 tosses. Coupon collector. Expect every bin has ≥ 1 ball after M ln M tosses. Load balancing. After M tosses, expect most loaded bin has Θ ( log M / log log M ) balls. 13Uniform hashing assumption Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M 1. Bins and balls. Throw balls uniformly at random into M bins. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Hash value frequencies for words in Tale of Two Cities (M = 97) Java's String data uniformly distribute the keys of Tale of Two Cities 143.4 HASH TABLES hash functions ‣ separate chaining ‣ linear probing ‣ Algorithms context ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduCollisions Collision. Two distinct keys hashing to same index. Birthday problem ⇒ can't avoid collisions unless you have a ridiculous (quadratic) amount of memory. Coupon collector + load balancing ⇒ collisions are evenly distributed. 0 1 hash("it") = 3 2 3 "it" 4 hash("times") = 3 5 Challenge. Deal with collisions efficiently. 16Separatechaining symbol table Use an array of M N linked lists. H. P. Luhn, IBM 1953 Hash: map key to integer i between 0 and M 1. th Insert: put at front of i chain (if not already there). th Search: need to search only i chain. key value hash S 2 0 E 0 1 A 8 E 12 A 0 2 R 4 3 st null 0 C 4 4 1 H 4 5 2 X 7 S 0 E 0 6 3 X 2 7 4 L 11 P 10 A 0 8 M 4 9 P 3 10 M 9 H 5 C 4 R 3 L 3 11 E 0 12 17 Hashing with separate chaining for standard indexing client Separatechaining symbol table: Java implementation public class SeparateChainingHashSTKey, Value array doubling and private int M = 97; // number of chains halving code omitted private Node st = new NodeM; // array of chains private static class Node private Object key; no generic array creation (declare key and value of type Object) private Object val; private Node next; ... private int hash(Key key) return (key.hashCode() 0x7fffffff) M; public Value get(Key key) int i = hash(key); for (Node x = sti; x = null; x = x.next) if (key.equals(x.key)) return (Value) x.val; return null; 18Separatechaining symbol table: Java implementation public class SeparateChainingHashSTKey, Value private int M = 97; // number of chains private Node st = new NodeM; // array of chains private static class Node private Object key; private Object val; private Node next; ... private int hash(Key key) return (key.hashCode() 0x7fffffff) M; public void put(Key key, Value val) int i = hash(key); for (Node x = sti; x = null; x = x.next) if (key.equals(x.key)) x.val = val; return; sti = new Node(key, val, sti); 19Analysis of separate chaining Proposition. Under uniform hashing assumption, prob. that the number of keys in a list is within a constant factor of N / M is extremely close to 1. Pf sketch. Distribution of list size obeys a binomial distribution. (10, .12511...) .125 0 30 0 10 20 4 3 Binomial distribution (N = 10 , M = 10 , = 10) equals() and hashCode() Consequence. Number of probes for search/insert is proportional to N / M. M too large ⇒ too many empty chains. M times faster than M too small ⇒ chains too long. sequential search Typical choice: M N / 4 ⇒ constanttime ops. 20Resizing in a separatechaining hash table Goal. Average length of list N / M = constant. Double size of array M when N / M ≥ 8. Halve size of array M when N / M ≤ 2. x.hashCode() does not change Need to rehash all keys when resizing. but hash(x) can change before resizing st A B C D E F G H I J 0 1 K L M N O P after resizing K I st 0 P N L E A 1 2 J F C B 3 O M H G D 21Deletion in a separatechaining hash table Q. How to delete a key (and its associated value) A. Easy: need only consider chain containing key. before deleting C after deleting C K I K I st st 0 0 P N L P N L 1 1 2 2 J F C B J F B 3 3 O M O M 22Symbol table implementations: summary guarantee guarantee guarantee average case average case average case or order dered ed key key implementation implementation ops ops interface interface search insert delete search hit insert delete sequential search equals() N N N ½ N N ½ N (unordered list) binary search compareTo() ✔ lg N N N lg N ½ N ½ N (ordered array) compareTo() BST ✔ N N N 1.39 lg N 1.39 lg N √ N compareTo() redblack BST ✔ 2 lg N 2 lg N 2 lg N 1.0 lg N 1.0 lg N 1.0 lg N equals() separate chaining N N N 35 35 35 hashCode() under uniform hashing assumption 233.4 HASH TABLES hash functions ‣ separate chaining ‣ linear probing ‣ Algorithms context ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduCollision resolution: open addressing Open addressing. AmdahlBoehmeRochersterSamuel, IBM 1953 When a new key collides, find next empty slot, and put it there. st0 jocularly null st1 listen st2 st3 suburban null st30000 browsing linear probing (M = 30001, N = 15000) 25Linearprobing hash table demo Hash. Map key to integer i between 0 and M1. Insert. Put at table index i if free; if not try i+1, i+2, etc. linearprobing hash table 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 st M = 16Linearprobing hash table demo Hash. Map key to integer i between 0 and M1. Search. Search table index i; if occupied but no match, try i+1, i+2, etc. search K hash(K) = 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 st P M A C S H L E R X K M = 16 search miss (return null)Linearprobing hash table summary Hash. Map key to integer i between 0 and M1. Insert. Put at table index i if free; if not try i+1, i+2, etc. Search. Search table index i; if occupied but no match, try i+1, i+2, etc. Note. Array size M must be greater than number of keyvalue pairs N. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 st P M A C S H L E R X M = 16 28Linearprobing symbol table: Java implementation public class LinearProbingHashSTKey, Value private int M = 30001; array doubling and private Value vals = (Value) new ObjectM; halving code omitted private Key keys = (Key) new ObjectM; private int hash(Key key) / as before / private void put(Key key, Value val) / next slide / public Value get(Key key) for (int i = hash(key); keysi = null; i = (i+1) M) if (key.equals(keysi)) return valsi; return null; 29Linearprobing symbol table: Java implementation public class LinearProbingHashSTKey, Value private int M = 30001; private Value vals = (Value) new ObjectM; private Key keys = (Key) new ObjectM; private int hash(Key key) / as before / private Value get(Key key) / previous slide / public void put(Key key, Value val) int i; for (i = hash(key); keysi = null; i = (i+1) M) if (keysi.equals(key)) break; keysi = key; valsi = val; 30Clustering Cluster. A contiguous block of items. Observation. New keys likely to hash into middle of big clusters. 31Knuth's parking problem Model. Cars arrive at oneway street with M parking spaces. Each desires a random space i : if space i is taken, try i + 1, i + 2, etc. Q. What is mean displacement of a car displacement = 3 Halffull. With M / 2 cars, mean displacement is 3 / 2. Full. With M cars, mean displacement is π M / 8 . 32Analysis of linear probing Proposition. Under uniform hashing assumption, the average of probes in a linear probing hash table of size M that contains N = α M keys is: ⇥ ⇥ 1 1 1 1 ⇥ 1+ ⇥ 1+ 2 2 1 2 (1 ) search hit search miss / insert Pf. Parameters. M too large ⇒ too many empty array entries. M too small ⇒ search time blows up. probes for search hit is about 3/2 Typical choice: α = N / M ½. probes for search miss is about 5/2 33Resizing in a linearprobing hash table Goal. Average length of list N / M ≤ ½. Double size of array M when N / M ≥ ½. Halve size of array M when N / M ≤ ⅛. Need to rehash all keys when resizing. before resizing 0 1 2 3 4 5 6 7 keys E S R A vals 1 0 3 2 after resizing 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 keys A S E R vals 2 0 1 3 34Deletion in a linearprobing hash table Q. How to delete a key (and its associated value) A. Requires some care: can't just delete array entries. before deleting S 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 P M A C S H L E R X keys 10 9 8 4 0 5 11 12 3 7 vals doesn't work, e.g., if hash(H) = 4 after deleting S 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 P M A C H L E R X keys 10 9 8 4 5 11 12 3 7 vals 35ST implementations: summary guarantee guarantee guarantee average case average case average case or order dered ed key key implementation implementation ops ops interface interface search insert delete search hit insert delete sequential search equals() N N N ½ N N ½ N (unordered list) binary search compareTo() ✔ lg N N N lg N ½ N ½ N (ordered array) compareTo() BST ✔ N N N 1.39 lg N 1.39 lg N √ N compareTo() redblack BST ✔ 2 lg N 2 lg N 2 lg N 1.0 lg N 1.0 lg N 1.0 lg N equals() separate chaining N N N 35 35 35 hashCode() equals() linear probing N N N 35 35 35 hashCode() under uniform hashing assumption 363.4 HASH TABLES hash functions ‣ separate chaining ‣ linear probing ‣ Algorithms context ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduWar story: algorithmic complexity attacks Q. Is the uniform hashing assumption important in practice A. Obvious situations: aircraft control, nuclear reactor, pacemaker. A. Surprising situations: denialofservice attacks. st 0 1 2 3 malicious adversary learns your hash function 4 (e.g., by reading Java API) and causes a big pileup 5 in single slot that grinds performance to a halt 6 7 Realworld exploits. CrosbyWallach 2003 Bro server: send carefully chosen packets to DOS the server, using less bandwidth than a dialup modem. Perl 5.8.0: insert carefully chosen strings into associative array. Linux 2.4.20 kernel: save files with carefully chosen names. 38Format For Printing XML Clone This Bug Last Comment Bug 750533 (CVE20122739) CVE20122739 java: hash table collisions CPU usage DoS (oCERT2011003) Status: ASSIGNED Reported: 20111101 10:13 EDT by Jan Lieskovsky Modified: 20121127 10:50 EST (History) Aliases: CVE20122739 (edit) CC List: 8 users (show) Product: Security Response See Also: Component: vulnerability (Show other bugs) Fixed In Version: Version(s): unspecified Platform: All Linux Doc Type: Bug Fix Doc Text: Priority: medium Severity: medium Clone Of: Target Milestone: Environment: Target Release: Last Closed: 20111229 07:40:08 Assigned To: Red Hat Security Response Team QA Contact: URL: Whiteboard: impact=moderate,public=20111228,repor... Keywords: Reopened, Security War story: algorithmic complexity attacks Depends On: Blocks: hashdos/oCERT2011003 750536 Show dependency tree / graph A Java bug report. Groups: None (edit) Attachments (Terms of Use) Add an attachment (proposed patch, testcase, etc.) Jan Lieskovsky 20111101 10:13:47 EDT Description Julian Wälde and Alexander Klink reported that the String.hashCode() hash function is not sufficiently collision resistant. hashCode() value is used in the implementations of HashMap and Hashtable classes: http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html http://docs.oracle.com/javase/6/docs/api/java/util/Hashtable.html A speciallycrafted set of keys could trigger hash function collisions, which can degrade performance of HashMap or Hashtable by changing hash table operations complexity from an expected/average O(1) to the worst case O(n). Reporters were able to find colliding strings efficiently using equivalent substrings and meet in the middle techniques. This problem can be used to start a denial of service attack against Java applications that use untrusted inputs as HashMap or Hashtable keys. An example of such application is web application server (such as tomcat, see bug 750521) that may fill hash tables with data from HTTP request (such as GET or POST parameters). A remote attack could use that to make JVM use excessive amount of CPU time by sending a POST request with large amount of parameters which hash to the same value. This problem is similar to the issue that was previously reported for and fixed in e.g. perl: http://www.cs.rice.edu/scrosby/hash/CrosbyWallachUsenixSec2003.pdf Jan Lieskovsky 20111101 10:18:44 EDT Comment 2 Acknowledgements: Red Hat would like to thank oCERT for reporting this issue. oCERT acknowledges Julian Wälde and Alexander Klink as the original reporters. 39 Tomas Hoger 20111229 07:23:27 EST Comment 11 This issue was presented on 28C3: http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.html Details were posted to fulldisclosure: http://seclists.org/fulldisclosure/2011/Dec/477Algorithmic complexity attack on Java Goal. Find family of strings with the same hash code. Solution. The base31 hash code is part of Java's string API. key hashCode() key hashCode() key hashCode() "Aa" 2112 "AaAaAaAa" 540425984 "BBAaAaAa" 540425984 "BB" 2112 "AaAaAaBB" 540425984 "BBAaAaBB" 540425984 "AaAaBBAa" 540425984 "BBAaBBAa" 540425984 "AaAaBBBB" 540425984 "BBAaBBBB" 540425984 "AaBBAaAa" 540425984 "BBBBAaAa" 540425984 "AaBBAaBB" 540425984 "BBBBAaBB" 540425984 "AaBBBBAa" 540425984 "BBBBBBAa" 540425984 "AaBBBBBB" 540425984 "BBBBBBBB" 540425984 N 2 strings of length 2N that hash to same value 40Diversion: oneway hash functions Oneway hash function. "Hard" to find a key that will hash to a desired value (or two keys that hash to same value). Ex. MD4, MD5, SHA0, SHA1, SHA2, WHIRLPOOL, RIPEMD160, …. known to be insecure String password = args0; MessageDigest sha1 = MessageDigest.getInstance("SHA1"); byte bytes = sha1.digest(password); / prints bytes as hex string / Applications. Digital fingerprint, message digest, storing passwords. Caveat. Too expensive for use in ST implementations. 41Separate chaining vs. linear probing Separate chaining. Performance degrades gracefully. Clustering less sensitive to poorlydesigned hash function. key hash value Linear probing. S 2 0 E 0 1 A 8 E 12 Less wasted space. A 0 2 Better cache performance. R 4 3 st null 0 C 4 4 1 H 4 5 2 X 7 S 0 E 0 6 3 X 2 7 4 L 11 P 10 A 0 8 M 4 9 P 3 10 M 9 H 5 C 4 R 3 L 3 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 E 0 12 P M A C S H L E R X keys Hashing with separate chaining for standard indexing client 10 9 8 4 0 5 11 12 3 7 vals 42Hashing: variations on the theme Many improved versions have been studied. Twoprobe hashing. separatechaining variant Hash to two positions, insert key in shorter of the two chains. Reduces expected length of the longest chain to log log N. Double hashing. linearprobing variant Use linear probing, but skip a variable amount, not just 1 each time. Effectively eliminates clustering. Can allow table to become nearly full. More difficult to implement delete. Cuckoo hashing. linearprobing variant Hash key to two positions; insert key into either position; if occupied, reinsert displaced key into its alternative position (and recur). Constant worstcase time for search. 43Hash tables vs. balanced search trees Hash tables. Simpler to code. No effective alternative for unordered keys. Faster for simple keys (a few arithmetic ops versus log N compares). Better system support in Java for strings (e.g., cached hash code). Balanced search trees. Stronger performance guarantee. Support for ordered ST operations. Easier to implement compareTo() correctly than equals() and hashCode(). Java system includes both. Redblack BSTs: java.util.TreeMap, java.util.TreeSet. Hash tables: java.util.HashMap, java.util.IdentityHashMap. 44
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
Dr.BenjaminClark
User Type:
Teacher
Country:
United States
Uploaded Date:
21-07-2017