Question? Leave a message!

Chained hash tables

Chained hash tables
ECE 250 Algorithms and Data Structures Chained hash Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering tables University of Waterloo Waterloo, Ontario, Canada © 20062013 by Douglas Wilhelm Harder. Some rights reserved.Chained hash tables 2 Outline We have: – Discussed techniques for hashing – Discussed mapping down to a given range 0, ..., M – 1 Now we must deal with collisions – Numerous techniques exist – Containers in general • Specifically linked listsChained hash tables 3 Background First, a review: – We want to store objects in an array of size M – We want to quickly calculate the bin where we want to store the object • We came up with hash functions—hopefully Q(1) • Perfect hash functions (no collisions) are difficult to design – We will look at some schemes for dealing with collisionsChained hash tables 4 Implementation Consider associating each bin with a linked list: template class Type class Chainedhashtable private: int tablecapacity; int tablesize; SinglelistType table; unsigned int hash( Type const ); public: Chainedhashtable( int = 16 ); int count( Type const ) const; void insert( Type const ); int erase( Type const ); // ... ;Chained hash tables 5 Implementation The constructor creates an array of n linked lists template class Type Chainedhashtable::Chainedhashtable( int n ): tablecapacity( std::max( n, 1 ) ), tablesize( 0 ), table( new SinglelistTypetablecapacity ) // empty constructor Chained hash tables 6 Implementation The function hash will determine the bin of an object: template class Type int Chainedhashtable::hash( Type const obj ) return hashM( obj.hash(), capacity() ); Recall: – obj.hash() returns a 32bit nonnegative integer – unsigned int hashM( obj, M ) returns a value in 0, …, M – 1Chained hash tables 7 Implementation Other functions mapped to corresponding linked list functions: template class Type void Chainedhashtable::insert( Type const obj ) unsigned int bin = hash( obj ); if ( tablebin.count( obj ) == 0 ) tablebin.pushfront( obj ); ++tablesize; Chained hash tables 8 Implementation template class Type int Chainedhashtable::count( Type const obj ) const return tablehash( obj ).count( obj ); Chained hash tables 9 Implementation template class Type int Chainedhashtable::erase( Type const obj ) unsigned int bin = hash( obj ); if ( tablebin.erase( obj ) ) tablesize; return 1; else return 0; Chained hash tables 10 Example As an example, let’s store hostnames and allow a fast lookup of the corresponding IP address – We will choose the bin based on the host name – Associated with the name will be the IP address – E.g., ("optimal", hash tables 11 Example We will store strings and the hash value of a string will be the last 3 bits of the first character in the host name – The hash of "optimal" is based on "o"Chained hash tables 12 Example The following is a list of the binary representation of each letter: – "a" is 1 and it cycles from there… a 01100001 n 01101110 b 01100010 o 01101111 c 01100011 p 01110000 d 01100100 q 01110001 e 01100101 r 01110010 f 01100110 s 01110011 g 01100111 t 01110100 h 01101000 u 01110101 i 01101001 v 01110110 j 01101010 w 01110111 k 01101011 x 01111000 l 01101100 y 01111001 m 01101101 z 01111010Chained hash tables 13 Example Our hash function is unsigned int hash( string const str ) // the empty string "" is hashed to 0 if str.length() == 0 ) return 0; return str0 7; Chained hash tables 14 Example Starting win an array of 8 empty linked listsChained hash tables 15 Example The pair ("optimal", is entered into bin 01101111 = 7Chained hash tables 16 Example Similarly, as "c" hashes to 3 – The pair ("cheetah", is entered into bin 3Chained hash tables 17 Example The "w" in Wellington also hashes to 7 – ("wellington", is entered into bin 7Chained hash tables 18 Example Why did I use pushfront from the linked list – Do I have a choice – A good heuristic is ―unless you know otherwise, data which has been accessed recently will be accessed again in the near future‖ – It is easiest to access data at the front of a linked list Heuristics include rules of thumb, educated guesses, and intuitionChained hash tables 19 Example Similarly we can insert the host names "augustin" and "lowpower"Chained hash tables 20 Example If we now wanted the IP address for "optimal", we would simply hash "optimal" to 7, walk through the linked list, and access when we access the node containing the relevant stringChained hash tables 21 Example Similarly, "ashok" and "vlach" are entered into bin 7Chained hash tables 22 Example Inserting "ims", "jab", and "cad" doesn’t even out the binsChained hash tables 23 Example Indeed, after 21 insertions, the linked lists are becoming rather long – We were looking for Q(1) access time, but accessing something in a linked list with k objects is O(k)Chained hash tables 24 Load Factor To describe the length of the linked lists, we define the load factor of the hash table: n l  M This is the average number of objects per bin – This assumes an even distribution Right now, the load factor is l = 21/8 = 2.625 – The average bin has 2.625 objectsChained hash tables 25 Load Factor If the load factor becomes too large, access times will start to increase: O(l) The most obvious solution is to double the size of the hash table – Unfortunately, the hash function must now change – In our example, the doubling the hash table size requires us to take, for example, the last four bitsChained hash tables 26 Doubling Size The load factor is now l = 1.3125 – Unfortunately, the distribution hasn’t improved muchChained hash tables 27 Doubling Size There is significant clustering in bins 2 and 3 due to the choice of host namesChained hash tables 28 Choosing a Good Hash Function We choose a very poor hash function: – We looked at the first letter of the host name Unfortunately, all these are also actual host names: ultra7 ultra8 ultra9 ultra10 ultra11 ultra12 ultra13 ultra14 ultra15 ultra16 ultra17 blade1 blade2 blade3 blade4 blade5 This will cause clustering in bins 2 and 5 – Any hash function based on anything other than every letter will cause clusteringChained hash tables 29 Choosing a Good Hash Function Let’s go back to the hash function defined previously: unsigned int hash( string const str ) unsigned int hashvalue = 0; for ( int k = 0; k str.length(); ++k ) hashvalue = 12347hashvalue + strk; return hashvalue; Chained hash tables 30 Choosing a Good Hash Function This hash function yields a much nicer distribution:Chained hash tables 31 Choosing a Good Hash Function When we insert the names of all 55 names, we would have a load factor l = 3.4375 – Clearly there are not exactly 3.4375 objects per bin – How can we tell if this is a good hash function – Can we expect exactly 3.4375 objects per bin • Clearly no… – The answer is with statistics…Chained hash tables 32 Choosing a Good Hash Function We would expect the number of bins which hold k objects to approximately follow a Poisson distribution kChained hash tables 33 Problems with Linked Lists One significant issue with chained hash tables using linked lists – It requires extra memory – It uses dynamic memory allocation Total memory: 16 bytes • A pointer to an array, initial and current number of bins, and the size + 12M bytes (8M if we remove count from Singlelist) + 8n bytes if each object is 4 bytesChained hash tables 34 Problems with linked lists For faster access, we could replace each linked list with an AVL tree (assuming we can order the objects) – The access time drops to O(ln(l)) – The memory requirements are increased by Q(n), as each node will require two pointers We could look at other techniques: – Scatter tables: use other bins and link the bins – Use an alternate memory allocation modelChained hash tables 35 Black Board Example Use the hash function unsigned int hash( unsigned int n ) return n 10; to enter the following 15 numbers into a hash table with 10 bins: 534, 415, 465, 459, 869, 442, 840, 180, 450, 265, 23, 946, 657, 3, 29Chained hash tables 36 Summary The easiest way to deal with collisions is to associate each bin with a container We looked at bins of linked lists – The example used host names and IP addresses – We defined the load factor l = n/M – Discussed doubling the number of bins – Our goals are to choose a good hash function and to keep the load factor low – We discussed alternatives Next we will see a different technique using only one array of bins: open addressingChained hash tables 37 References Wikipedia, 1 Cormen, Leiserson, and Rivest, Introduction to Algorithms, McGraw Hill, 1990. rd 2 Weiss, Data Structures and Algorithm Analysis in C++, 3 Ed., Addison Wesley. These slides are provided for the ECE 250 Algorithms and Data Structures course. The material in it reflects Douglas W. Harder’s best judgment in light of the information available to him at the time of preparation. Any reliance on these course slides by any party for any other purpose are the responsibility of such parties. Douglas W. Harder accepts no responsibility for damages, if any, suffered by any party as a result of decisions made or actions based on these course slides for any other purpose than that for which it was intended.
Document Information
User Name:
User Type:
United Kingdom
Uploaded Date: