Guidelines and rules for GetHashCode.

I often bumped into questions like: What is the GetHashCode metho? Why use it? How to implement it correctly? So let’s try to ask on them.

What is GetHashCode used for?
It is used for putting object in a hash table.

Why do we have this method implemented on Object?
In order to be able to compare objects to other for equality.

How do hash tables and similar data structures use GetHashCode?
Let’s look at piece of code first.

class Set<T>
{
  private List<T> list = new List<T>();
  public void Insert(T item)
  {
    if (!Contains(t))
      list.Add(item);
  }
  public bool Contains(T item)
  {
    foreach(T member in list)
      if (member.Equals(item))
        return true;
    return false;
  }
}

The test for containment here is linear; if there are ten thousand items in the list then we have to look at all ten thousand of them to determine that the object is not in the list. This does not scale well.

The trick is to trade a small amount of increased memory burden for a huge amount of increased speed. The idea is to make many shorter lists, called “buckets”, and then be clever about quickly working out which bucket we’re looking at:

class Set<T>
{
  private List<T>[] buckets = new List<T>[100];
  public void Insert(T item)
  {
    int bucket = GetBucket(item.GetHashCode());
    if (Contains(item, bucket))
      return;
    if (buckets[bucket] == null)
    buckets[bucket] = new List<T>();
    buckets[bucket].Add(item);
  } 
  public bool Contains(T item)
  {
    return Contains(item, GetBucket(item.GetHashCode());
  }
  private int GetBucket(int hashcode)
  {
    unchecked
    {
      // A hash code can be negative, and thus its remainder can be negative also.
      // Do the math in unsigned ints to be sure we stay positive.
      return (int)((uint)hashcode % (uint)buckets.Length);
    }
  }
  private bool Contains(T item, int bucket)
  {
    if (buckets[bucket] != null)
    foreach(T member in buckets[bucket])
      if (member.Equals(item))
        return true;
    return false;
  }
}

Now if we have ten thousand items in the set then we are looking through one of a hundred buckets, each with on average a hundred items; the Contains operation just got a hundred times cheaper. On average.

There are plenty of improvements we could make to this hash table. But this quick sketch of a naive implementation of a hash table will do for now.

Rule: equal items have equal hashes
If two objects are equal then they must have the same hash code; or, equivalently, if two objects have different hash codes then they must be unequal.

Note that it is not a rule that if two objects have the same hash code, then they must be equal. There are only four billion or so possible hash codes, but obviously there are more than four billion possible objects. There are far more than four billion ten-character strings alone. Therefore there must be at least two unequal objects that share the same hash code.

Guideline: the integer returned by GetHashCode should never change
Ideally, the hash code of a mutable object should be computed from only fields which cannot mutate, and therefore the hash value of an object is the same for its entire lifetime.

However, this is only an ideal-situation guideline; the actual rule is:
Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable

If an object’s hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn’t find it.

Rule: GetHashCode must never throw an exception, and must return
Getting a hash code simply calculates an integer; there’s no reason why it should ever fail. An implementation of GetHashCode should be able to handle any legal configuration of the object.

Guideline: the implementation of GetHashCode must be extremely fast
The whole point of GetHashCode is to optimize a lookup operation; if the call to GetHashCode is slower than looking through those ten thousand items one at a time, then you haven’t made a performance gain.

Guideline: the distribution of hash codes must be “random”
By a “random distribution” I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced.

Used resources:
http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/
http://stackoverflow.com/questions/462451/gethashcode-guidelines-in-c-sharp

Share this post:Tweet about this on TwitterShare on Facebook0Share on LinkedIn0Share on Google+0Share on Reddit0Email this to someoneDigg this