CIS 22 - Data Structures: Hash Table - Handling Collisions



Handling Collisions

     The two ways of dealing with collisions are: rehashing and chaining.

Rehashing - resolving a collision by computing a new hash location (index) in the array.

Linear Probing is one type of rehash (getting a new index) - a very simple one. In case of linear probing, we are looking for an empty spot incrementing the offset by 1 every time. We explore a sequence of location until an empty one is found as follows:    

index, index + 1, index + 2, index + 3, ...


In the case of linear probing the code for rehash method will look like this:    
    int Hashtable::rehash(int index) 
    {
        int new_index;

        new_index = (index + 1) % TABLE_SIZE;
       
        return new_index; 
    }


An alternative to linear probing is quadratic probing.

Quadratic Probing is a different way of rehashing. In the case of quadratic probing we are still looking for an empty location. However, instead of incrementing offset by 1 every time, as in linear probing, we will increment the offset by 1, 3, 5, 7, ... We explore a sequence of location until an empty one is found as follows:    

index, index + 1, index + 4, index + 9, index + 16, ...

Linear Probing is resolving a hash collision by sequentially searching a hash table beginning at the location returned by the hash function.

In this case, hash table is implemented using an array. The program stores the first element that generates a specific array index at that index. For example, if the hash function generates 79, then you use array index 79 to store the element. When the hash function generates the key 79 again, the program begins a sequential search starting at location 79, looking for the next available spot. The second element whose key was transformed by hash function into 79 will be stored at the location 80, the third at 81 and so on. Of course, if 80 and 81 are already occupied, the elements will be stored farther away from the location generated by hash function.


Retrieving a value.

When it comes to retrieving a value, the program recomputes the array index and checks the key of the element stored at that location. If the desired key value matches the key value at that location, then the element is found. The search time is on the order of 1, O(1) (1 comparison per data item). If the key doesn’t match, then the search function begins a sequential search of the array that continues until:
  • the desired element is found
  • the search encounters an unused position in the array, indicating that the element is not present
  • the search encounters the index which was produced by the hash function,indicating that the table is full and the element is not present
In worst case, the search takes (n-1) comparison, which is on the order of n, O(n). “Worst case” is if the table is full and the search function goes through the whole array, (n-1) elements, every time comparing the desired key value with the key at the array location. In the worst case element is either found at the last position before the one that was produced by the hash function or not found at all.

Advantages to this approach
  • All the elements (or pointer to the elements) are placed in contiguous storage. This will speed up the sequential searches when collisions do occur.

Disadvantages to this approach
  • As the number of collisions increases, the distance from the array index computed by the hash function and the actual location of the element increases, increasing search time.
  • Element tend to cluster around elements that produce collisions*. As the array fills, there will be gaps of unused locations.
  • The hash table has a fixed size. At some point all the elements in the array will be filled. The only alternative at that point is to expand the table, which also means modify the hash function to accommodate the increased address space.
*Note: There are slight variations to the preceding scheme that attempt to space out the elements, avoiding the clustering of elements that occurs with adjacent placement. Collided elements can be placed some fixed number of locations away from the location generated by the hash function. For example, suppose we decided to place the elements four locations apart. If the first element is stored at the location 76, the second element whose hashed key generated is 76, would be stored at 80, the third at 84 and so on. If a location is occupied, the program adds four and checks again, until it finds an unused location.
The major alternative to using adjacent storage is to use linked list. In this case table is implemented as an array of linked lists. Every element of the table is a pointer to a list. The list will contain all the elements with the same index produced by the hash function. For example, when the hash function produces key 79, the new node is created and the array element with index 79 now points to this node. When another element’s key produces index 79, the new node will be created and attached to the first one and so on. The same set of data as in the previous example will be organized as follows when we use linked list for handling collisions:


Retrieving a value.
     When it comes to retrieve a value, the hash function is applied to the desired key and the array index is produced. Then the search function accesses the list to which this array element points to. It compares the desired key to the key in the first node. If it matches, then the element is found. In this case the search of the element is on the order of 1, O(1). If not, then the function goes through the list comparing the keys. Search continues until the element is found or the end of the list is detected. In this case the search time depends on the length of the list.

Advantages to this approach
  • The hash table size is unlimited (or limited only by available storage space). In this case you don’t need to expand the table and recreate a hash function.
  • Collision handling is simple: just insert colliding records into a list.

Disadvantages to this approach
  • As the list of collided elements (collision chains) become long, search them for a desired element begin to take longer and longer.
  • Using pointers slows the algorithm: time required is to allocate new nodes.

Back to
the main page
Back to CIS 22 home page
Luba Leyzerenok, Fall 2003