Improving Upon LINQ’s Distinct

August 1, 2009

tags: ,
2 comments

Suppose you have a collection of objects, and you want only the distinct values. You could easily use Linq for this, like that:

var items = new[] {"BMW","Fiat","Ferrari","Fiat"};

var distinctItems = items.Distinct();

This uses the default comparer for the objects in order to see if they’re equal. There’s also another overload, which lets you specify your own comparer. This is its signature:

public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source, IEqualityComparer<TSource> comparer);

It accepts an IEqualityComparer to compare between items. But the usage of IEqualityComparer is a bit annoying, as you have to create you own class that implements that interface. It would be nice if I could just do this:

var items = new[] {"HI", "hi","hello","HELLO"};

var distinctItems = items.Distinct((str1, str2) => str1.ToLower().Equals(str2.ToLower()));

So in this example we would expect the result array to contain two items and not four.

Let’s see if we can make this happen. We’ll create our own IEqualityComparer which accepts a lambda.

   1: public class LambdaEqualityComparer<T> : IEqualityComparer<T>

   2: {

   3:     public Func<T,T,bool> _comparison;

   4:  

   5:     public LambdaEqualityComparer(Func<T,T,bool> comparison)

   6:     {

   7:         _comparison = comparison;

   8:     }

   9:     

  10:     public bool Equals(T x, T y)

  11:     {

  12:         return _comparison(x, y);

  13:     }

  14:  

  15:     public int GetHashCode(T obj)

  16:     {

  17:         return obj.GetHashCode();

  18:     }

  19: }

Here we have a class that accepts the comparison lambda as a parameter. Now we can add our own Distinct extension method:

public static class EnumerableExtensions

{

    public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source, Func<TSource,TSource,bool> comparison)

    {

        return source.Distinct(new LambdaEqualityComparer<TSource>(comparison));

    }

}

Once again, let’s try to run this:

var items = new[] {"HI", "hi","hello","HELLO"};

var distinctItems = items.Distinct((str1, str2) => str1.ToLower().Equals(str2.ToLower()));

Well, that didn’t work out well at all. It returns a four items array. Digging into it a bit, we can see that the problem is in our LambdaEqualityComparer class, specifically with our naive GetHashCode implementation (lines 15-17 above):

public int GetHashCode(T obj)

{

    return obj.GetHashCode();

}

As you can see our comparison doesn’t come into play at all for GetHashCode. It also appears that the original Distinct uses (at least in some cases) GetHashCode and not Equals to determine if two objects are equal, and so our lambda comparison doesn’t even run.

We need to attack this differently, then. Let’s try to make this code work:

var items = new[] {"HI", "hi","hello","HELLO"};

var distinctItems2 = items.DistinctBy(str1 => str1.ToLower());

So in this new DistinctBy method, we’re not comparing two items, but selecting something to compare by for every item. The implementation is similar to before, and here is the entire thing:

   1: public static class EnumerableExtensions

   2: {

   3:     public static IEnumerable<TSource> DistinctBy<TSource, TSelection>(this IEnumerable<TSource> source, Func<TSource, TSelection> selector)

   4:     {

   5:         return source.Distinct(new SelectorEqualityComparer<TSource, TSelection>(selector));

   6:     }

   7:  

   8:     private class SelectorEqualityComparer<TSource, TSelection> : IEqualityComparer<TSource>

   9:     {

  10:         public Func<TSource, TSelection> _selector;

  11:  

  12:         public SelectorEqualityComparer(Func<TSource, TSelection> selector)

  13:         {

  14:             _selector = selector;

  15:         }

  16:  

  17:         public bool Equals(TSource x, TSource y)

  18:         {

  19:             return Object.Equals(_selector(x), _selector(y));

  20:         }

  21:  

  22:         public int GetHashCode(TSource obj)

  23:         {

  24:             return _selector(obj).GetHashCode();

  25:         }

  26:     }

  27: }

As you can see, both the implementation of GetHashCode and Equals now use our selector lamda, and so the code works, and all is well with the world.

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

2 comments

  1. KellyOctober 16, 2009 ב 22:18

    Thanks, I was looking for something just like this! Can’t believe Microsoft hasn’t made distinct work with a lambda like this.

    Reply
  2. Mike FaulkinburyNovember 18, 2009 ב 22:44

    Thought I’d share our solution to the problem with your first example: just add another function to your constructor, to be used in the GetHashCode implementation.

    public LambdaEqualityComparer(Func comparerFunction, Func hashCodeFunction){…}

    I like the idea of having a single “selector” function, too, but it gets tough when you need to do a multiple member comparison (Person.Age && Person.LastName).

    Thanks again for the inspiration!

    Reply