Question? Leave a message!




Double hashing

Double hashing
ECE 250 Algorithms and Data Structures Double hashing Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo, Ontario, Canada ece.uwaterloo.ca dwharderalumni.uwaterloo.ca © 20062013 by Douglas Wilhelm Harder. Some rights reserved.Double hashing 2 Outline This topic covers double hashing – More complex than linear or quadratic probing – Uses two hash functions • The first gives the bin • The second gives the jump size – Primary clustering no longer occurs – More efficient than linear or quadratic probingDouble hashing 3 Background Linear probing: – Look at bins k, k + 1, k + 2, k + 3, k + 4, … – Primary clustering Quadratic probing: – Look at bins k, k + 1, k + 4 , k + 9, k + 16, … – Secondary clustering (dangerous for poor hash functions) – Expensive: • Primesized arrays • Euclidean algorithm for calculating remaindersDouble hashing 4 Background Linear probing causes primary clustering – All entries follow the same search pattern for bins: int initial = hashM( x.hash(), M ); for ( int k = 0; k M; ++k ) bin = (initial + k) M; // ... Double hashing 5 Background This can be partially solved with quadratic probing – The step size grows in quadratic increments: 0, 1, 4, 9, 16, ... int bin = hashM( x.hash(), M ); for ( int k = 0; k M; ++k ) bin = (bin + k) M; // ... – Problems: what happens if multiple things are placed in the same bin • The same steps are followed for each: secondary clustering • This indicates a potentially poor hash function • This sometimes cannot be avoidedDouble hashing 6 Background An alternate solution – Give each object (with high probability) a different jump sizeDouble hashing 7 Description Associate with each object an initial bin and a jump size For example, int initial = hashM( x.hash(), M ); int jump = hash2M( x.hash(), M ); for ( int k = 0; k M; ++k ) bin = (initial + kjump) M; Double hashing 8 Description Problem: – Will initial + kjump step through all of the bins – Here, the jump size is 7: M = 16; initial = 5 jump = 7; for ( int k = 0; k = M; ++k ) std::cout (initial + kjump) M ' '; – The output is 5 12 3 10 1 8 15 6 13 4 11 2 9 0 7 14 5Double hashing 9 Description Problem: – Will initial + kjump step through all of the bins – Now, the jump size is 12: M = 16; initial = 5 jump = 12; for ( int k = 0; k = M; ++k ) std::cout (initial + kjump) M ' '; – The output is now 5 1 13 9 5 1 13 9 5 1 13 9 5 1 13 9 5Double hashing 10 Description 1 The jump size and the number of bins M must be coprime – They must share no common factors There are two ways of ensuring this: – Make M a prime number – Restrict the prime factors 1 also known as relatively primeDouble hashing 11 Making M Prime If we make the table size M = p a prime number then all values between 1, ..., p – 1 are relatively prime – Example: 13 does not share any common factors with 1, 2, 3, ..., 11, 12 Problems: – All operations must be done using • Cannot use , , or • The modulus operator is relatively slow – Doubling the number of bins is difficult: • What is the next prime after 2 × 263Double hashing 12 m Using M = 2 We can restrict the number of prime factors m If we ensure M = 2 then: – The only prime factor of M is 2 – All odd numbers are relatively prime to M Benefits: – Doubling the size of the array is easy – We can use bitwise operationsDouble hashing 13 m Using M = 2 Given a number, how do we ensure the jump size is odd Make the leastsignificant bit a 1: unsigned int makeodd( unsigned int n ) return n 1; For example: 0010101101100100111100010101010 00000000000000000000000000000001 = 00101011011001001111000101010101Double hashing 14 m Using M = 2 This also works for signed integers and 2’s complement: – The least significant bit of a negative odd number is 1 For example –1 11111111111111111111111111111111 –2 11111111111111111111111111111110 –3 11111111111111111111111111111101 –4 11111111111111111111111111111100 –5 11111111111111111111111111111011 –6 11111111111111111111111111111010 –7 11111111111111111111111111111001 –8 11111111111111111111111111111000 –9 11111111111111111111111111110111Double hashing 15 m Using M = 2 Devising a second hash function is necessary One solution: define two functions hashM1 and hashM2 which map onto 0, …, M – 1 – Use a different set of m bits, e.g., if m = 10 use 00101001011011011111000100010101 . 1 initial jump where initial == 365 and jump == 965 Useful only if m ≤ 16Double hashing 16 m Using M = 2 Another solution: int initial = hashM( x.hash(), M ); int jump = hashM( x.hash() 532934959, M ) 1; for ( int k = 0; k M; ++k ) bin = (initial + kjump) (M 1); // ... 532934959 is primeDouble hashing 17 Example Consider a hash table with M = 16 bins Given a 3digit hexadecimal number: – The leastsignificant digit is the primary hash function (bin) – The next digit is the secondary hash function (jump size) – Example: for 6B72A , the initial bin is A and the jump size is 3 16 Double hashing 18 Example Insert these numbers into this initially empty hash table 19A, 207, 3AD, 488, 5BA, 680, 74C, 826, 946, ACD, B32, C8B, DBE, E9C 0 1 2 3 4 5 6 7 8 9 A B C D E FDouble hashing 19 Example Start with the first four values: 19A, 207, 3AD, 488 0 1 2 3 4 5 6 7 8 9 A B C D E FDouble hashing 20 Example Start with the first four values: 19A, 207, 3AD, 488 0 1 2 3 4 5 6 7 8 9 A B C D E F 207 488 19A 3ADDouble hashing 21 Example Next we must insert 5BA 0 1 2 3 4 5 6 7 8 9 A B C D E F 207 488 19A 3ADDouble hashing 22 Example Next we must insert 5BA – Bin A is occupied – The jump size is B is already odd – A jump size of B is equal to a jump size of B – 10 = –5 16 0 1 2 3 4 5 6 7 8 9 A B C D E F 207 488 19A 3ADDouble hashing 23 Example Next we must insert 5BA – Bin A is occupied – The jump size is B is already odd – A jump size of B is equal to a jump size of B – 10 = –5 16 0 1 2 3 4 5 6 7 8 9 A B C D E F 5BA 207 488 19A 3AD – The sequence of bins is A, 5Double hashing 24 Example Next we are adding 680, 74C, 826 0 1 2 3 4 5 6 7 8 9 A B C D E F 5BA 207 488 19A 3ADDouble hashing 25 Example Next we are adding 680, 74C, 826 – All the bins are empty—simply insert them 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 5BA 826 207 488 19A 74C 3ADDouble hashing 26 Example Next, we must insert 946 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 5BA 826 207 488 19A 74C 3ADDouble hashing 27 Example Next, we must insert 946 – Bin 6 is occupied – The second digit is 4, which is even – The jump size is 4 + 1 = 5 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 5BA 826 207 488 19A 74C 3ADDouble hashing 28 Example Next, we must insert 946 – Bin 6 is occupied – The second digit is 4, which is even – The jump size is 4 + 1 = 5 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 5BA 826 207 488 19A 946 74C 3AD – The sequence of bins is 6, BDouble hashing 29 Example Next, we must insert ACD 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 5BA 826 207 488 19A 946 74C 3ADDouble hashing 30 Example Next, we must insert ACD – Bin D is occupied – The jump size is C is even, so C + 1 = D is odd – A jump size of D is equal to a jump size of D – 10 = –3 16 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 5BA 826 207 488 19A 946 74C 3ADDouble hashing 31 Example Next, we must insert ACD – Bin D is occupied – The jump size is C is even, so C + 1 = D is odd – A jump size of D is equal to a jump size of D – 10 = –3 16 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 ACD 5BA 826 207 488 19A 946 74C 3AD – The sequence of bins is D, A, 7, 4Double hashing 32 Example Next, we insert B32 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 ACD 5BA 826 207 488 19A 946 74C 3ADDouble hashing 33 Example Next, we insert B32 – Bin 2 is unoccupied 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 19A 946 74C 3ADDouble hashing 34 Example Next, we insert C8B 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 19A 946 74C 3ADDouble hashing 35 Example Next, we insert C8B – Bin B is occupied – The jump size is 8 is even, so 8 + 1 = 9 is odd – A jump size of 9 is equal to a jump size of 9 – 10 = –7 16 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 19A 946 74C 3ADDouble hashing 36 Example Next, we insert C8B – Bin B is occupied – The jump size is 8 is even, so 8 + 1 = 9 is odd – A jump size of 9 is equal to a jump size of 9 – 10 = –7 16 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 19A 946 74C 3AD C8B – The sequence of bins is B, 4, D, 6, FDouble hashing 37 Example Inserting D59, we note that bin 9 is unoccupied 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 D59 19A 946 74C 3AD C8BDouble hashing 38 Example Finally, insert E9C 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 D59 19A 946 74C 3AD C8BDouble hashing 39 Example Finally, insert E9C – Bin C is occupied – The jump size is 9 is odd – A jump size of 9 is equal to a jump size of 9 – 10 = –7 16 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 D59 19A 946 74C 3AD C8BDouble hashing 40 Example Finally, insert E9C – Bin C is occupied – The jump size is 9 is odd – A jump size of 9 is equal to a jump size of 9 – 10 = –7 16 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 D59 19A 946 74C 3AD E9C C8B – The sequence of bins is C, 5, EDouble hashing 41 Example Having completed these insertions: – The load factor is l = 14/16 = 0.875 – The average number of probes is 25/14 ≈ 1.79 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 D59 19A 946 74C 3AD E9C C8BDouble hashing 42 Resizing the array To double the capacity of the array, each value must be rehashed – 680, B32, ACD, 5BA, 826, 207, 488, D59 may be immediately placed • We use the leastsignificant five bits for the initial bin 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 680 826 207 488 ACD B32 D59 5BADouble hashing 43 Resizing the array To double the capacity of the array, each value must be rehashed – 19A resulted in a collision – The jump size is now 000110011010 or C + 1 = D = 13 2 10 • We are using the next five bits for the jump size 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 680 826 207 488 ACD B32 19A D59 5BA – The sequence of bins: 1A, 7, 14Double hashing 44 Resizing the array To double the capacity of the array, each value must be rehashed – 946 resulted in a collision – The jump size is now 100101000110 or A + 1 = B = 11 2 10 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 680 826 207 488 ACD 946 B32 19A D59 5BA – The sequence of bins: 6, 11Double hashing 45 Resizing the array To double the capacity of the array, each value must be rehashed – 74C fits into its bin 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 680 826 207 488 74C ACD 946 B32 19A D59 5BADouble hashing 46 Resizing the array To double the capacity of the array, each value must be rehashed – 3AD resulted in a collision – The jump size is now 001110101101 or 1D = 29 ≡ –3 2 10 10 32 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 680 826 207 488 3AD 74C ACD 946 B32 19A D59 5BA – The sequence of bins: D, ADouble hashing 47 Resizing the array To double the capacity of the array, each value must be rehashed – Both E9C and C8B fit without a collision – The load factor is l = 14/32 = 0.4375 – The average number of probes is 18/14 ≈ 1.29 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 680 826 207 488 3AD C8B 74C ACD 946 B32 19A D59 5BA E9CDouble hashing 48 Erase Can we remove an object like we did with linear probing – Clearly, no, as there are M/2 possible locations where an object which could have occupied a position could be located As with quadratic probing, we will use lazy deletion – Mark a bin as ERASED; however, when searching, treat the bin as occupied and continue enum binstatet UNOCCUPIED, OCCUPIED, binstatet stateM; ERASED for ( int i = 0; i M; ++i ) ; statei = UNOCCUPIED; Double hashing 49 Erase If we erase 3AD, we must mark that bin as erased 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 959 19A 946 74C 3AD E9C C8BDouble hashing 50 Find When searching, it is necessary to skip over this bin – For example, find ACD: D, A, 7, 4 find C8B: B, 4, D, 6, F 0 1 2 3 4 5 6 7 8 9 A B C D E F 680 B32 ACD 5BA 826 207 488 959 19A 946 74C 3AD E9C C8BDouble hashing 51 Multiple insertions and erases One problem which may occur after multiple insertions and removals is that numerous bins may be marked as ERASED – In calculating the load factor, an ERASED bin is equivalent to an OCCUPIED bin This will increase our run times…Double hashing 52 Multiple insertions and erases We can easily track the number of bins which are: – UNOCCUPIED – OCCUPIED – ERASED by updating appropriate counters If the load factor l grows too large, we have two choices: – If the load factor due to occupied bins is too large, double the table size – Otherwise, rehash all of the objects currently in the hash tableDouble hashing 53 Expected number of probes As with quadratic probing, the number of probes is significantly lower than for linear probing: 1  ln  – Successful searches: 1l  l 1 – Unsuccessful searches: 1l When l = 2/3, we requires 1.65 and 3 probes, respectively – Linear probing required 3 and 5 probes, respectively Load Factor (l) Unsuccessful search Successful search Reference: Knuth, The Art of Computer Programming, nd Vol. 3, 2 Ed., 1998, Addison Wesley, p. 530.Double hashing 54 Double hashing versus linear probing Comparing the two: Linear probing Unsuccessful search Successful search Double hashing Unsuccessful search Successful search Load Factor (l) Examined BinsDouble hashing 55 Cache misses Double hashing solves the secondary clustering problem – Unfortunately, each subsequent probe could be anywhere in the array – For large arrays, it is unlikely that block is in the cache • This will flag a cache miss and another page will be copied to the cache – This is slower than quadratic probing – It may also remove another page from the cache that is to be called again soon in the future • If a change was made to the page in the cache, it must be copied out – When at all possible, use quadratic probingDouble hashing 56 Summary In this topic, we have looked at double hashing: – An open addressing technique – Uses two hash functions: • The first indicates the bin • The second gives the jump size – Insertions and searching are straight forward – Removing objects is more complicated: use lazy deletion – It may be useful, on occasion, to clean the table – Much worse than quadratic probing with respect to cache missesDouble hashing 57 References Wikipedia, http://en.wikipedia.org/wiki/Hashfunction 1 Cormen, Leiserson, and Rivest, Introduction to Algorithms, McGraw Hill, 1990. rd 2 Weiss, Data Structures and Algorithm Analysis in C++, 3 Ed., Addison Wesley. These slides are provided for the ECE 250 Algorithms and Data Structures course. The material in it reflects Douglas W. Harder’s best judgment in light of the information available to him at the time of preparation. Any reliance on these course slides by any party for any other purpose are the responsibility of such parties. Douglas W. Harder accepts no responsibility for damages, if any, suffered by any party as a result of decisions made or actions based on these course slides for any other purpose than that for which it was intended.
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
Dr.SethPatton
User Type:
Teacher
Country:
United Kingdom
Uploaded Date:
22-07-2017