Talking Distinctions

22 בדצמבר 2010

אין תגובות

The Enumerable<T>.Distinct method is a very useful helper; it was definitely requited to complete the LINQ to SQL offering (to support TSQL’s DISTINCT), and the corresponding support in LINQ to objects is very nice indeed. However, since DISTINCT runs on primitives in the DB, and objects may be a bit more complex than a simple byte-to-byte comparison – the plot might thicken.

This test passes easily.

[Test]

public void DistinctOnPrimitives()

{

    var arr = new[] {1, 1, 2};

    Assert.AreEqual(2, arr.Distinct().Count());
}

But this one fails:

public class A { public int B { get; set; } }

[Test]

public void NaiveDistinctOnObjects()

{

    var arr = new[] { new A { B = 1 }, new A { B = 1 }, new A { B = 2 } };

    Assert.AreEqual(2, arr.Distinct().Count());

}

We could override the Equals and GetHashCode to make the test pass.

Btw, have ReSharper do it for you. Click “Generate” on the type and select “Equality Members”.

image

image

And BAM:

public class A

{

    public int B { get; set; }

 

    public bool Equals(A other)

    {

        if (ReferenceEquals(null, other)) return false;

        if (ReferenceEquals(this, other)) return true;

        return other.B == B;

    }

 

    public override bool Equals(object obj)

    {

        if (ReferenceEquals(null, obj)) return false;

        if (ReferenceEquals(this, obj)) return true;

        if (obj.GetType() != typeof (A)) return false;

        return Equals((A) obj);

    }

 

    public override int GetHashCode()

    {

        return B;

    }

}

And that’s one way to make the previous test pass. But what if we don’t want to implement equality? What if the object should normally be compared by ref, and not by value? What if the equality grammar for this object is not what’s required for this specific Distinction? What if we cannot change the class?

A lovely override for Distinct accepts an IEqualityComparer<T> parameter. That’s a whole new type. So if we need to send an instance of a comparer, we need to both implement the comparison and instantiate it.

(Note that I’ve omitted a lot of required checks (nulls) for brevity)

public class AEqualityComparer : IEqualityComparer<Tests.A>

{

    public bool Equals(Tests.A x, Tests.A y)

    {

        return x.B == y.B;

    }

 

    public int GetHashCode(Tests.A obj)

    {

        return obj.B;

    }

}

And now we can have this passing test:

[Test]

public void EqualityComparerDistinctOnObjects()

{

    var arr = new[] { new A { B = 1 }, new A { B = 1 }, new A { B = 2 } };

    Assert.AreEqual(2, arr.Distinct(new AEqualityComparer()).Count());

}

But I have a distinct feeling (pun intended. Send your complaints here) that real production logic would call this Distinct(new AEqualityComparer()) a lot, and the instantiation is totally useless there.

<SingletonsAhead>

Fortunately, there’s a pattern to help us with that issue – we can have a single instance of the comparer to be used every call. Since our comparer object is stateless, and is basically just an encapsulation of unchanging comparison logic – I have no problems actually using this pattern here.

I’m using a cool little Singleton<T> helper class which looks like this:

public abstract class Singleton<T>

{

    public static T Instance { get { return Nested.InnerInstance; } }

 

    private static class Nested

    {

        public static T InnerInstance { get; private set; }

 

        static Nested()

        {

            InnerInstance = (T)Activator.CreateInstance(typeof (T), true);

        }

    }

}

(It’s lockless and lazy, so it’s a good enough implementation).

Now our comparer class looks like this:

public class AEqualityComparer : Singleton<AEqualityComparer>, IEqualityComparer<Tests.A>

{

    protected AEqualityComparer() {}

 

    public bool Equals(Tests.A x, Tests.A y)

    {

        return x.B == y.B;

    }

 

    public int GetHashCode(Tests.A obj)

    {

        return obj.B;

    }

}

And the passing tests looks like this:

[Test]

public void SingletonEqualityComparerDistinctOnObjects()

{

    var arr = new[] { new A { B = 1 }, new A { B = 1 }, new A { B = 2 } };

    Assert.AreEqual(2, arr.Distinct(AEqualityComparer.Instance).Count());

}

Which is, in my mind, HORRIBLE.

The actual comparison is twice removed from where it’s executed, and I ha to create a new class for this pointless exercise.

</SingletonsAhead>

A much much cooler method to do this would be like so:

[Test]

public void DistinctByPropertyOnObjects()

{

    var arr = new[] { new A { B = 1 }, new A { B = 1 }, new A { B = 2 } };

    Assert.AreEqual(2, arr.DistinctByX(x => x.B).Count());

}

Or, if we have more than one property to compare – like this:

[Test]

public void DistinctByPropertyOnMoreComplexObjects()

{

    var arr = new[] { new A { B = 1 }, new A { B = 1 }, new A { B = 2 } };

    Assert.AreEqual(2, arr.DistinctByX(x => x.B, x => x.C).Count());

}

So how do we go about implementing that? Long story short – here’s the code for the extension method:

public static class EnumerableOfTExtenstions

{

    public static IEnumerable<T> DistinctByX<T>(this IEnumerable<T> source, 

        params Func<T, object>[] membersToCompare)

    {

        return source.Distinct(AdHocComparer<T>.For(membersToCompare));

    }

    private class AdHocComparer<T> : IEqualityComparer<T>

    {

        private readonly Func<T, object>[] _MembersToCompare;

 

        private AdHocComparer(Func<T, object>[] membersToCompare)

        {

            _MembersToCompare = membersToCompare;

        }

 

        public static IEqualityComparer<T> For(Func<T, object>[] membersToCompare)

        {

            return new AdHocComparer<T>(membersToCompare);

        }

 

        public bool Equals(T x, T y)

        {

            return _MembersToCompare.All(c => c(x).Equals(c(y)));

        }

 

        public int GetHashCode(T obj)

        {

            return _MembersToCompare.Aggregate(0, (agg, c) => agg ^ c(obj).GetHashCode());

        }

    }

}

Now, it’s not perfect by any means: we’re still instantiating objects per comparison (need to cache those somehow, and use an efficient key for that cache), we have an additional overhead from the delegate access to the properties, we’re still not checking for nulls and all kinds of funky cases (need to write some more code), but by golly, that’s EXPRESSIVE, ain’t it?

So after having my fun with the extension method – I’ve reverted my code to use the singleton-based comparer. It felt like the grownup thing to do. Sorry for the anti-climax.

הוסף תגובה
facebook linkedin twitter email

כתיבת תגובה

האימייל לא יוצג באתר. שדות החובה מסומנים *