CIS 22 - Data Structures: Hash Tables and Hash Functions


Hash Table, Hash Function, Collisions.
    
     Hash Table is a data structure in which keys are mapped to array positions by a hash function. This table can be searched for an item in O(1) time using a hash function to form an address from the key. The easiest way to conceptualize a hash table is to think of it as an array. When a program stores an element in the array, the elements key is transformed by a hash function that produces array indexes for that array.

     Hash Function is a function which, when applied to the key, produces an integer which can be used as an address in a hash table. The intent is that elements will be relatively randomly and uniformly distributed. In the example above the code for the hash function would look like this (assuming TABLE_SIZE was defined 100):      

    int Hashtable::hash_function(int id_num) 
    {
        return (id_num % TABLE_SIZE); 
    }

Then, for example, to find an element in the table, a program applies hash function to the element’s key, producing the array index at which the element is stored.      
    Type* Hashtable::find (char* key) 
    { 
        int index; 

        index = hash_function(key); 

        //finding the element 
    
    } 

Unfortunately, this is not quite simple. For example, we have already stored several employees’ records, and our table looks something like this:


    Suppose, our next employee’s key is 57879. Then our hash function will produce array index 79. But the array element with index 79 already has a value. As the array begins to fill, keys will inevitably produce the same array index when transformed by the hash function. When more than one element tries to occupy the same array position, we have a collision.

Collision is a condition resulting when two or more keys produce the same hash location.

Now, another question arises: can’t we generate the hash function that never produces a collision? The answer is that in some cases it is possible to generate such a function. This is known as a perfect hash function.

Perfect Hash Function is a function which, when applied to all the members of the set of items to be stored in a hash table, produces a unique set of integers within some suitable range. Such function produces no collisions.

Good Hash Function minimizes collisions by spreading the elements uniformly throughout the array.

     There is no magic formula for the creation of the hash function. It can be any mathematical transformation that produces a relatively random and unique distribution of values within the address space of the storage. Although the development of a hash function is trial and error, here are some hints that may make process easier:
  • Set the size of the storage space to a prime number. This will help generate a more uniform distribution of addresses.
  • Use modulo arithmetic (%). Transform a key in such a way that you can perform X % TABLE_SIZE to generate the addresses
  • To transform a numeric key, try something like adding the digits together or picking every other digit.
  • To transform a string key, try to add up the ASCII codes of the characters in the string and then perform modulo division.
    However, finding a perfect hash function that works for a given data set can be extremely time consuming and very often it is just impossible. Therefore we must live with collisions and learn how to handle them.
Back to
the main page
Back to CIS 22 home page
Luba Leyzerenok, Fall 2003