I'm currently learning for a test I have this week in Data Structures course. I was reading about Hash tables when I realized - .Net has this "GetHashCode" method that I had always ignored, maybe I can learn for my test and C# on the same time!
I immediately opened VS and created the following class:
public class Student
{
public string Name { get; set; }
public override int GetHashCode()
{
return 1;
}
}
As you can see, the GetHashCode will return the same value for every instance of the class. This is bad, very bad actually. Hash functions are all about spreading the data in a uniform way. This is really isn't the case here... Let's see how such a thing can effect performance.
I used the 3.5 new added collection - HashSet<T>. I guessed that my bad hash function will shine the most when using a hash-based collection. HashSet should be very effective - Add, Remove and Contains should run in O(1) time... this is based, though, on the uniformity assumption - which is not correct in this example.
I ran the following code:
static void Main(string[] args)
{
HashSet<Student> t = new HashSet<Student>();
Stopwatch s = new Stopwatch();
s.Start();
for (int i = 0; i < 10000; i++)
{
t.Add(new Student());
}
s.Stop();
Console.WriteLine("It took {0} ms", s.ElapsedMilliseconds);
}
And what I got was this horrifying result:
Almost 4 seconds to insert only 10,000 records! that's A LOT!
I decided to fix it so I changed the Student class:
public class Student
{
public string Name { get; set; }
public int ID { get; set; }
public static int id = 0;
public Student()
{
ID = ++id;
}
public override int GetHashCode()
{
return ID;
}
}
I ran the testing code again and got the following result:
Wow! The improvement in the GetHasCode method resulted in 325% execution time improvement!
Why did it happen?
In short, a hash collection uses an array to index its members. Each array item contains a pointer to the actual item. The index of the array item is calculated using the hash function (in C# - the GetHashCode method). When the hash function returns a unique index for each item, there's no problem and the array index can be used to point for the unique instance. The problem starts when the same hash code exists for 2 different instances. This is when the fun begins!
When something like this happens, the process to find a new index is called "collision resolution" and it has several possible implementations (I don't know what was used in the HashSet implementation). As faster the collision is resolved, the faster the code will run.
My GetHashCode implementation in the first example wasn't uniform at all, collisions couldn't be resolved with ease so what I did, actually, was taking all the good things of HashSet to their worst peek...
This was a real short explanation of hashing... If I write something like that in my test tomorrow I'll probably fail... If you want to read more about hashing and collisions, wikipedia is a good place to start.
In conclusion
The GetHashCode method can effect performance dramatically. It'd better to leave it alone and let the .Net framework calculate it for you. If you are really into writing your own one, choose a really good implementation that will keep your indexes uniform enough.
All the best,
Shay