DCSIMG
Google+ Doron's .NET Space

A while ago we wanted to choose a new natural language parser for our product. One of the strongest candidates in that area is the Charniak-Johnson parser. We ended up not choosing to use this parser for various reasons, but in the process of evaluating it we produced a nice side benefit: we compiled it (well, the Charniak part of it) and ran it natively on Windows.

The source code for the converted project can be found on GitHub, and the binaries can be found here.

Now, before I tell you a bit about the conversion process, here are a few important notes:

  • This is not the full CJ parser. This is the first-stage Charniak parser only (Aug06 version). The re-ranking second stage was not converted.
  • We converted only the non-multithreaded version of the main loop (oparseit.c and not parseit.c). Still, the parsing code should support calls from multiple threads should you choose to do so (up to the MAXNUMTHREADS parameter which defaults to 4).
  • This is a completely native version of the parser, it doesn’t need cygwin/mingw32 or anything of the sort.
  • It does need, however, the Visual C++ 2012 Redistributables to run.
  • You need Visual Studio 2012 to open the project (but you can probably easily port it to older VS versions).
  • We are not actually using this in our product, anywhere. We thought about using it, but we’re not. This is provided ‘as is’ as a service to the community, and the original license of the parser remains in place.

So now that the formalities are out of the way, here are some details about the conversion process. First, I should note that I am no C++ expert. C# is my thing, and while I did some C/C++ projects in the university, it was quite a while ago. Therefore, it was not an easy task and it took me a few days to get the entire thing to work. Here are some things I had to do (probably not all, this was a long time ago):

  • Convert all the .c files to .cpp files so Visual Studio will be happy with the code using C++ constructs.
  • Fix header issues of all sorts (like using #stdafx.h everywhere).
  • Add unicode support by changing all the constant strings to have the L macro applied on them, use wchar_t everywhere in the code, use wistream and friends instead of istream and friends, etc.
  • Figuring out code and files that are not needed for pure parsing and get rid of them. The less files I have to touch, the better. So for instance, I got rid of all the EvalTree stuff in the original parser (a different command which evaluates parse trees. Didn’t need that).

The biggest headache, though, was the performance issue. For some reason, the parser performed much worse on Windows than on Linux, and it was pretty hard to understand why. I ended up using the Visual Studio C++ profiler, and traced the issue to a line that allocates a large array of objects. Digging a little further, I noticed that each allocated object contains a large std::list, so in fact, I was allocating lots of lists. The code basically boils down to this:

list<char*>* item = new list<char*>[160000]; 

I measured the code above to be 10 times slower on Windows than on Linux. Funky, eh? At first I thought that memory allocation on Windows is slower than on Linux. Who knows, right? Well, the folks at StackOverflow helped me realize that memory allocation has nothing to do with this, it is actually the std::list constructor that is slower in the Microsoft implementation, probably because it allocates a bigger initial buffer. Once I knew this, I was able to tweak to code to dynamically allocate the std::lists, so I was able to create a lot less of them. This solved the performance issue, and in fact I measured the parser to be even faster on Windows that on Linux now.

That’s it for now, I guess. If you have any questions about using the binaries or the code posted above, please contact me.

Posted by dorony | 3 comment(s)
תגים:,

At Ginger we use a large index of n-grams, which is basically a sequence of words and their frequency in our corpus. We wanted to make this index searchable, so naturally, we defaulted to using Lucene, which is the most popular open source IR library. This is how we started adding documents to the index:

   1: Document document = new Document();
   2: document.add(new Field("ngram", ngram, Field.Store.YES, Field.Index.ANALYZED));
   3: NumericField frequencyField = new NumericField("frequency", Field.Store.YES, true);
   4: frequencyField.setLongValue(frequency);
   5: document.add(frequencyField);

As you can see, we are indexing and storing both the ngram and the frequency fields, where one is a string and the other is a long.

After the indexing was done, we wanted to search the index. Our main use case for searching is this: take a bunch of words, and return the most frequent n-grams that contain them. My natural inclination was to use sorting for this, by using IndexSearcher.search(Query query, int n, Sort sort). This allows us to supply a Sort object that would sort by the frequency field in descending order.

This turned out to be a bad idea. Lucene handles sorting by using the FieldCache. This means it tried to load all the frequencies of all the n-grams in the index into an array. This is a lot of data to load, and an OutOfMemoryException was thrown. But then I realized that I always wanted results to return ordered by frequency. I don’t need different result sets to be sorted by different fields, so this information should be stored in the index itself and shouldn’t require a cache.

Therefore, I arrived at the concept of Scoring. Scoring in Lucene determines how results are ordered by default, which is exactly what I needed. The default algorithm attempts to return the results most relevant to a given query, and I wanted to spice it up with my frequency information. The way you do that in Lucene is by using document level boosting. This allows you to say that a specific document is more relevant than others. At first I did the naive thing and set the boost to be the frequency, like this:

   1: document.setBoost(frequency);

It even seemed to work for a while, but then I noticed a couple of issues:

  1. The boost factor was too dominant: Lucene sometimes preferred to return n-grams with less query words, but have a greater frequency, over n-grams with more query words, but with less frequency. So, for instance if I search for “pleasure meet you” it would prefer “I want to meet you” (100000 frequency) over “pleasure to meet you” (10000 frequency). This caused me to revise my perspective: I don’t always want results to be returned ordered by frequency, the amount of contained query words is an important factor as well.
  2. Sometimes results weren’t exactly ordered by frequency. An n-gram with 1000 frequency seemed to have exactly the same score as an n-gram with 10000 frequency, even though it was given a higher boost factor.

While the first issue was rather obvious, the second one had me baffled for a while. It turns out even though the boost value is a float, it is stored as only a byte, as the document norm. This value combines the document boost, the field boost (individual fields can be boosted as well, not useful for my case) and a length factor, which basically says that shorter documents are better (again not relevant for me, all the n-grams in a single index have the same length). So basically, we took a long (the frequency), which has a huge range of different values, and stuffed it into a byte which only allows for 256 different values. Obviously, I’m going to lose a bit of precision here.

The solution to both issues was to compute the boost in a different way. Instead of setting it to the frequency, we will set it to some function of the frequency. This should achieve:

  1. Smaller values for the boost factor, so it doesn’t override the coordination factor, which causes documents with more query terms to get a higher score.
  2. A better ordering of n-grams by frequency. An n-gram with 2000 frequency should get a higher score than an n-gram with 1000 frequency, but I don’t care if an n-gram with 110000 frequency has the same score as 100000 frequency. In essence, I can afford losing precision at the higher ranges of frequency, but less so in the lower ranges.

The function we came up with looks something like this:

   1: public static float getBoost(long frequency) {
   2:     double boost = (Math.log(frequency) / Math.log(1.5)) * 10;
   3:     if (boost > 255)
   4:         return 255;
   5:     return Math.round((float)boost);
   6: }

Basically, we want the boost to be a number between 1 and 255 (which achieves our first goal of smaller boost values). By taking a log of the frequency (base 1.5) we are not only getting smaller values, but also achieving our second goal, of better precision for lower frequency values.  Above a certain threshold, all the n-grams will have the same boost factor of 255.

At this point I thought I was done, but I was still seeing precision loss. Lucene’s default way of encoding a float to a byte and vice versa didn’t suit me; It tries to represent a large range of values when I just need to use 1-255. Therefore, I had to create my own Similarity class which overrides this behaviour.

   1: public class NgramSimilarity extends DefaultSimilarity { 
   2:     @Override
   3:     public float computeNorm(String field, FieldInvertState state) {
   4:        return state.getBoost();    //ignore length factor
   5:    }     
   6:     @Override
   7:     public byte encodeNormValue(float f) {
   8:         return (byte)f;
   9:     }
  10:     @Override
  11:     public float decodeNormValue(byte b) {
  12:         return b & 0xFF; //makes the byte unsigned so we can use the full 0-255 range
  13:     }
  14: }
First, since the length factor is irrelevant to us, we override computeNorm in order to ensure that norm = boost only. Second, as you can see, the encoding method is quite simple, and just casts the float to byte. We know for sure that the boost is in the range of 1-255 and that norm = boost, so we can safely cast. The decoding method is a little weirder; Apparently Java’s byte is signed (unlike C#’s byte which is unsigned). A simple cast back from byte to float would have a given us a negative number, hence we apply the 0xFF mask.

We now just have to apply our Similarity class to IndexWriter, like this:

   1: IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_35, analyzer);
   2: indexWriterConfig.setSimilarity(new NgramSimilarity());
   3: IndexWriter writer = new IndexWriter(directory, indexWriterConfig);

It is also important to remember to use the custom Similarity class when searching (that’s when decodeNormValue is called):

   1: IndexSearcher searcher = new IndexSearcher(reader);
   2: searcher.setSimilarity(new NgramSimilarity());

And now we’re done.

My only niggle about Lucene is the documentation; It has excellent documentation, but only in the form of a book (which I read and enjoyed). But a PDF book is a lot less cosy to work with than something that is truly online (like blog posts). You can’t google and get results from Lucene In Action. Other than that, I would say that Lucene has been a perfect fit for our purpose. It is easily customizable, so we could tune it to our specific needs, and gives excellent query performance.  Open source is awesome, isn’t it?

Posted by dorony | with no comments

I’ve hit a weird issue today. We have a service that we run both as Windows service and from console. A specific use case seemed to cause our system to hang, but only when running from console. Also, I was sure this didn’t happen before I upgraded my machine to .NET 4.5. The service initialization code looks something like this:

serviceHost.Open();
 
while (Console.ReadKey().Key != ConsoleKey.Q)
{
    Console.WriteLine("Press Q to exit");
}

Basically, we start a WCF service, and wait for the user to press Q to exit the application. Fairly standard, and used to work until .NET 4.5.

Digging a little further, I found out that the system was stuck on some third-party code which was doing something like:

Console.Error.WriteLine("Error!");

We have other places that write to stdout (Console.WriteLine), but this seems to be the only place that writes to stderr (and only in very specific cases, which is why I didn’t notice it until today).

In order to reproduce the issue, I wrote the following program:

public static void Main()
{            
    var readTask = new Task(() =>
                                {
                                    Console.WriteLine("enter something");
                                    Console.ReadKey();
                                });
    var writeTask = new Task(() => Console.Error.WriteLine("hello"));
 
    readTask.Start();
    Thread.Sleep(100);            
    writeTask.Start();
 
    Task.WaitAll(readTask, writeTask);
}

We have two threads, one waits for an input from the user, and the other writes to Console.Error. This program will never terminate under .NET 4.5 (at least until you press a key), but will terminate under .NET 4.0. Fun! Let’s compare the .NET 4.5 and 4.0 source code in order to find out what happens here. Here is Console.ReadKey in .NET 4.5:

   1: Win32Native.InputRecord ir;
   2: int numEventsRead = -1; 
   3: bool r;
   4:  
   5: lock (InternalSyncObject) {
   6:    if (_cachedInputRecord.eventType == Win32Native.KEY_EVENT) { 
   7:        // We had a previous keystroke with repeated characters.
   8:        ir = _cachedInputRecord; 
   9:        if (_cachedInputRecord.keyEvent.repeatCount == 0) 
  10: ....

Can you see the lock in line 5? It doesn’t exist in the .NET 4.0 source. And here’s Console.Error in both .NET 4.0 and 4.5:

   1: public static TextWriter Error {
   2:     [HostProtection(UI=true)] 
   3:     get { 
   4:         Contract.Ensures(Contract.Result<TextWriter>() != null);
   5:         // Hopefully this is inlineable. 
   6:         if (_error == null)
   7:             InitializeStdOutError(false);
   8:         return _error;
   9:     } 
  10: }

As you can see, if this is the first time we’re using the stream it will try to initialize it. So let’s see the initialization code (again, for both .NET 4.0 and 4.5).

   1: private static void InitializeStdOutError(bool stdout) 
   2: { 
   3:     // Set up Console.Out or Console.Error.
   4:     lock(InternalSyncObject) { 
   5:         if (stdout && _out != null)
   6:             return;
   7:         else if (!stdout && _error != null)
   8:             return; 
   9: ....

Hmm, that lock seems familiar! This is what’s causing our code to deadlock. Console.ReadKey now locks a synchronization object which gets locked when you attempt to access stderr for the first time. Therefore, the solution is simple, we need to cause Console.Error to initialize before the call to Console.ReadKey.

   1: Console.Error.WriteLine("Initializing stderr to prevent threading issues...");
   2:  
   3: serviceHost.Open();
   4: while (Console.ReadKey().Key != ConsoleKey.Q)
   5: {
   6:     Console.WriteLine("Press Q to exit");
   7: }

Note that this may happen to you even if you’re targeting .NET 4.0, as long as you have .NET 4.5 installed. This is because .NET 4.5 is an in-place replacement for .NET 4.0, so you’re always running with the safer (and deadlock-causing) Console.ReadKey() if you installed the new framework version.

Poking around in dotPeek, I came across the implementation of String.Equals(string) in the .NET framework. This is how it looks:

   1: [ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
   2:     [SecuritySafeCritical]
   3:     private static unsafe bool EqualsHelper(string strA, string strB)
   4:     {
   5:       int length = strA.Length;
   6:       if (length != strB.Length)
   7:         return false;
   8:       fixed (char* chPtr1 = &strA.m_firstChar)
   9:         fixed (char* chPtr2 = &strB.m_firstChar)
  10:         {
  11:           char* chPtr3 = chPtr1;
  12:           char* chPtr4 = chPtr2;
  13:           while (length >= 10 && (*(int*) chPtr3 == *(int*) chPtr4 && *(int*) (chPtr3 + 2) == *(int*) (chPtr4 + 2)) 
  14:               && (*(int*) (chPtr3 + 4) == *(int*) (chPtr4 + 4) && *(int*) (chPtr3 + 6) == *(int*) (chPtr4 + 6) && *(int*) (chPtr3 + 8) == *(int*) (chPtr4 + 8)))
  15:           {
  16:             chPtr3 += 10;
  17:             chPtr4 += 10;
  18:             length -= 10;
  19:           }
  20:           while (length > 0 && *(int*) chPtr3 == *(int*) chPtr4)
  21:           {
  22:             chPtr3 += 2;
  23:             chPtr4 += 2;
  24:             length -= 2;
  25:           }
  26:           return length <= 0;
  27:         }
  28:     }

As you can tell, this is some funky stuff! As this is a performance critical method (comparing strings is a pretty common task, obviously), the implementers have gone all pointery. There are several things to note here:

  • The fixed keyword at lines 8-9: it tells the GC to leave our strings alone and not move them around while we’re using the pointers to them. Pointers won’t do much good if the managed objects are moved in the heap before the comparison is done.
  • The beautiful while loop at line 13 compares 20 bytes at a time (int = 4 bytes and we’re looking at 5 integers at a time). When we’re at the last 20 bytes, we do 4 bytes at a time. I assume this rather weird method of comparing is faster, perhaps due to less pointer increments/decrements. If one of you hackers would like to explain, that would be awesome.
  • One might wonder, since a char = 2 bytes, and we’re looking at 4 bytes at a time at a minimum, how come this method works on strings with an odd length? Well, our current guess is that since the string is null terminated in memory, we either compare the null char (for odd strings) or not (for even strings). Either way – it shouldn’t matter to the result, we can always assume an even amount of characters. This assumption would of course not work for strings in different lengths, but lines 5-7 take care of that for us. (This my colleague Yuval Pinter’s insight. Thanks!)

Like I said, funky stuff.

Posted by dorony | 6 comment(s)

I usually love using NuGet, but there are things about it that can sometimes make me scream with annoyance. My main issue is with the following scenario:

  1. I want to add a new dependency to a project.
  2. This dependency already exists in the solution.

Here, one of two things may happen:

  1. The version that exists in the solution is the newest that came out. In which case, no problemo, NuGet will see that I already have that package installed and use it.
  2. A new version came out since the we added the dependency to our solution.

In the second case, what I usually want to do is either use the same version other projects use (not the new version), or upgrade the entire solution to the new version. What I NEVER want is to keep two different versions of the same dependency in my solution. That is just confusing. And that is exactly what NuGet does when I “Install-Package” the dependency I want to a new project. The result is a big mess in our 'packages’ folder, which currently holds 4 different versions of AutoMapper.

Since I don’t want this to happen, my current workflow is like this:

  1. Decide I want to add a new dependency to the solution.
  2. Manually check in the packages folder if we’re already using a certain version of this dependency.
  3. Manually check in the NuGet package gallery if a new version came out.
  4. Decide to upgrade the entire solution, or -
  5. Explicitly install the older version with Install-Package –version

Needless to say, this is very cumbersome.

Posted by dorony | with no comments

*Update (27/8): It is critical to set the time below to AutoReset = false, otherwise, the exception might be raised again, even after we handled it.

I guess you could say that every exception is special and important in its own way, but ThreadAbortException is really special. In what way, you ask? Well, let me tell you a story.

I wanted to implement a hard timeout for our system. That is, if a query to our service takes too long, force kill it, not matter what it is doing at the moment. You may claim that this is not a best practice, but it was the only way to ensure that even if a horrible bug that is causing an infinite loop in some very rare conditions creeped into our system, it won’t cause a meltdown in the server. So code like this was placed:

   1: public T WithTimeout<T>(Func<T> action, int timeoutInMilliseconds )
   2: {
   3:     using (var timer = new Timer(timeoutInMilliseconds){AutoReset=false})
   4:     {                
   5:         var thread = Thread.CurrentThread;
   6:         timer.Elapsed += (o, s) => AbortThread(thread, timeoutInMilliseconds);
   7:         timer.Start();
   8:         try
   9:         {                    
  10:             return action();
  11:         }
  12:         catch (ThreadAbortException ex)
  13:         {            
  14:             throw new ForcedTimeoutException(string.Format("Query had to be foricbly aborted after {0} milliseconds.", timeoutInMilliseconds), ex);
  15:         }
  16:     }
  17: }

As you can see, I’m calling Thread.Abort in order to abort the work on the stuck thread, causing a ThreadAbortException to get thrown in it.

And this worked. Sort of. I wanted to write a test to make sure, so I wrote something like:

   1: [Test]
   2: public void Can_force_kill_query()
   3: {
   4:     try
   5:     {
   6:         WithTimeout(() => Thread.Sleep(1000), 10);
   7:     }
   8:     catch (ForcedTimeoutException)
   9:     {
  10:         return;
  11:     }
  12:     Assert.Fail();
  13: }

But I couldn’t get the test to pass! It seemed like ThreadAbortException keeps getting thrown, even though I’m catching it. Well, apparently, that’s what makes it special. ThreadAbortException will be thrown again and again by the CLR, even if you catch and “handle” it in a catch clause. You can overcome this behaviour, though, by calling Thread.ResetAbort. This puts things back to normal. So, the above catch clause was updated to:

   1: catch (ThreadAbortException ex)
   2: {         
   3:     Thread.ResetAbort();   
   4:     throw new ForcedTimeoutException(string.Format("Query had to be foricbly aborted after {0} milliseconds.", timeoutInMilliseconds), ex);
   5: }

Hopefully you won’t ever need this, but just in case.

Posted by dorony | 2 comment(s)

This is just a tiny little note that can save you some precious time. This article explains exactly what you need to do in order to profile NUnit tests with dotTrace (gotta love products with special placement of capital letters) 4.0, except it misses an important detail. The latest versions of NUnit run the actual tests in a separate process, called nunit-agent.exe. So, if you profile nunit-console.exe you won’t see your code there at all. You have to run NUnit in a single process mode. To do that just add /process:Single to the nunit-console command line arguments, and everything should work nicely now.

Posted by dorony | with no comments
תגים:

The other day I had to create a .CSV file with some funky Unicode characters. Not only that, but Excel had to be able to open and edit it. When using .NET’s StreamWriter default constructor, it uses a default encoding of UTF-8 without BOM (byte order mark), which Excel can’t read.

Well, actually, it can read it, but it doesn’t realize that this is a unicode file and special characters (such as this lovely one - Ž) have a tendency to look like someone just puked a letter on the screen.

The solution is quite simple. You need to use the other StreamWriter constructor, which accepts Encoding. You then have to pass it an Encoding of UTF-8 with BOM, which you can create by using the non-default constructor of the UTF8Encoding class. The whole thing looks something like this:

   1: private static readonly Encoding Utf8WithBom = new UTF8Encoding(true);
   2:  
   3: public void WriteSomething()
   4: {
   5:     using (var streamWriter = new StreamWriter(@"c:\hello.csv", true, Utf8WithBom))
   6:     {
   7:         streamWriter.WriteLine("hello,world");
   8:     }
   9: }

By the way, this is more of an issue with Excel then with .NET. As the above Wikipedia article states, the UTF-8 specification doesn’t require the BOM, and most applications can do without it.

Posted by dorony | 2 comment(s)
תגים:

I’m a C# guy, and always have been. I haven’t done any Java-ing since my days at the university, so I was glad of the opportunity to a do Java-related task at work. It was really a tiny little thing, but I got a little taste of the state Java is in and thought I would share.

Now, I’m hardly objective, and I’m basing this on a few hours of work tops. Also, any criticism I have is against Java the language, not Java the Framework.

So, with the disclaimers out of the way, let the rant begin. Basically, coding in Java is like coding in some sort of non existent C# 1.5. Here’s why:

No Linq and Lambda expressions. I found myself implementing stuff like Max and Average which I thought I would never have to again. I didn’t realize how used I got to the productivity that Linq+lambda expressions enable. They really have to write a lot of loops in Java, which is no fun.

Generics in Java suck. Basically, in Java a List<String> is a List<Object> and not a type by itself. This means, that among other issues, you can’t create a List<int>, only a List<Integer> so boxing is required, and performance suffers. This was done for backward compatibility purposes – since in Java generics is a compiler feature and not a framework supported feature, newer code can run on older versions of Java. But it results in a Generics implementation that is much worse than the C# one. This is the reason I think Java 7 is actually not as good as C# 2.0 was (not to talk about C# 4.0).

Simple operations are not simple enough. This is probably me at my most nitpicking, but I was really missing stuff like File.ReadAllLines and other file operations in Java. Of course, they can be easily implemented and I’m sure many did implement them, but it’s nice of .NET to give this out of the box. I consider this nitpicking, because there are probably lots of library methods in Java that are missing in C# and I just don’t know them. But still, reading a file, come on, make it trivial!

To end in a more positive note, as someone who hardly knows any Java I was up and running with an IDE (netbeans) in no time, and the experience was pretty similar to Visual Studio. It is definitely beginner friendly, and there is a ton of material on the web. Too bad the language is not up to par.

Posted by dorony | with no comments

I was informed today that a Visual Studio user voice issue I voted on was dismissed. The issue is “Create an x64 version of Visual Studio”, and around 1300 people voted for it. Sadly, Microsoft decided it’s not going to happen any time soon. Visual Studio’s program manager, Nathan Halstead, explains:

After reviewing telemetry on memory utilization, the scale of typical data sets loaded in Visual Studio, and hardware trends across our user base, we determined that one of the most effective ways to improve memory utilization across our entire user based would be to directly focus our efforts on overall memory reduction and improving scalability of key pieces of the product working with large data sets. While creating an x64 version of Visual Studio would unlock a virtually unlimited amount of memory to the Visual Studio process, most of our users would not see a direct performance or reliability benefit from this change alone.

In essence, building an x64 version of VS is a huge undertaking, and they just don’t think it’s worth it. Most users won’t tell the difference. Visual Studio 11 is already known to not have native x64 support, and looks like we can’t expect version 12 to be any different.

And that, well, kinda sucks. Why? Well:

I am regularly approaching the 2GB memory limit in my VS process. Visual Studio + Resharper work their magic together. As projects get bigger and more complex, it seems almost crazy to me that Microsoft doesn’t intend to support the ever growing demands of teams working with its IDE, and more than a few plugins. They may reduce the IDE memory consumption itself, but they can’t control the amount of memory plugins use – and plugins like Resharper are almost a must in every developer’s toolbox these days. And you can be sure that as the tools we use get smarter, they would need more memory. So yes, it’s a big project, but if they don’t begin it now, they’ll have some angry developers on their heads in a couple of years.

Posted by dorony | 2 comment(s)
תגים:

I guess this is something that is obvious to many, but it really had me stumped yesterday. I was looking at code that was joining two HashSets, that both had the same item. Not the same reference, but they had the same hash code, and Equals between the items returned true. And yet, Set1.UnionWith(Set2) changed Set1 to have two items. That seemed insane. Isn’t it in the contract of HashSet to not contain the same item twice? How could it be? It took a while, but I figured out the issue. Let’s try to recreate a situation where a set contains two identical items.

First, let’s define a Person class.

   1: public class Person : IEquatable<Person>
   2: {
   3:     public Person(string firstName, string lastName)
   4:     {
   5:         FirstName = firstName;
   6:         LastName = lastName;
   7:     }
   8:  
   9:     public string FirstName { get; set; }
  10:     public string LastName { get; set; }
  11:  
  12:     public bool Equals(Person other)
  13:     {
  14:         if (ReferenceEquals(null, other)) return false;
  15:         if (ReferenceEquals(this, other)) return true;
  16:         return Equals(other.FirstName, FirstName) && Equals(other.LastName, LastName);
  17:     }
  18:  
  19:     public override bool Equals(object obj)
  20:     {
  21:         if (ReferenceEquals(null, obj)) return false;
  22:         if (ReferenceEquals(this, obj)) return true;
  23:         if (obj.GetType() != typeof (Person)) return false;
  24:         return Equals((Person) obj);
  25:     }
  26:  
  27:     public override int GetHashCode()
  28:     {
  29:         unchecked
  30:         {
  31:             return ((FirstName != null ? FirstName.GetHashCode() : 0)*397) ^ (LastName != null ? LastName.GetHashCode() : 0);
  32:         }
  33:     }
  34: }

Note how the hash code computation takes both properties into account (this is Resharper’s auto generated code).

And now, let’s pass a test:

   1: [Test]
   2: public void MakeASetWithDuplicateValues()
   3: {
   4:     //The objects start out different
   5:     var p1 = new Person("Doron", "Yaacoby");
   6:     var p2 = new Person("Doron", "Y");
   7:  
   8:     //we put them in sets
   9:     var set1 = new HashSet<Person> { p1};            
  10:     var set2 = new HashSet<Person> { p2 };
  11:  
  12:     //we now make them equal
  13:     p2.LastName = "Yaacoby";
  14:     Assert.That(p1, Is.EqualTo(p2));
  15:     
  16:     //union create a set with duplicate values
  17:     set2.UnionWith(set1);
  18:     Assert.That(set2.Count, Is.EqualTo(2));
  19: }

So what happens here? We change set2’s item, but the set doesn’t know about it. It already computed p2’s hash code and stored it accordingly. It doesn’t know that its hash code has changed later. So when it comes to add p1 to the set, it thinks p1 is a brand new item with a brand new hash code, and lets it in.

Conclusion: do not change objects that are in sets, or even better – don’t create sets of mutable objects.

Posted by dorony | 3 comment(s)
תגים:

Lately I’ve been refactoring some really old code, and it helped me realize that in about 90% of the cases, inheritance from a class (unlike interface implementation) is a Bad Thing. Of course, I’m hardly the first to think that, but it’s not until I had to refactor deep, and absolutely wrong, object graphs until I felt it in my bones.

But why? Isn’t inheritance a legitimate way to reuse code?

Well, no, for several reasons:

  1. It makes the code harder to understand. To understand the flow of your class, you have to go all the way up to the highest base class, and follow the protected method route, which are implemented in various child classes. It’s a torture.
  2. It makes the code harder to test. Are your tests really testing the class itself, or its base class. If you want to test the class itself, and not base class, how do you mock it out?
  3. It makes the code harder to maintain. Adding new functionality can be complicated, as you decide to add it to the base class, or its child classes.

In fact, I suggest that whenever you’re writing a “BaseThingy” class, you should rethink your design, and prefer interface implementation and composition instead. Composition solves all three issues, and in most cases, is a much better choice. In fact, "Composition over Inheritance” is such a well accepted notion, that it even has its own Wikipedia article.

Posted by dorony | with no comments
תגים:

Hopefully you’ve encountered by now the System.Collections.Concurrent namespace. It’s new to .NET 4, and it has many useful data structures which are optimized for multi-threaded usage. ConcurrentBag, for instance, is an unordered collection of objects, which multiple threads can add and remove objects from at the same time.

Recently I needed to test that some piece of code was thread-safe. To do that, I wanted to run it once synchronously, and once asynchronously. It looked something like that:

   1: private static IEnumerable<ComputationResult> ComputeSynchronous(IEnumerable<string> inputs)
   2: {
   3:     Console.WriteLine("Starting synchrnous correction.");            
   4:     var results = new ConcurrentBag<ComputationResult>();
   5:     foreach (var input in inputs)
   6:     {       
   7:         //Some more printouts and work here
   8:         var result = computation.Compute(input);                
   9:         //Some more printouts and work here
  10:         results.Add(result);                
  11:     }
  12:     return results;
  13: }
  14:  
  15: private static IEnumerable<ComputationResult> ComputeAsynchronous(IEnumerable<string> inputs)
  16: {
  17:     Console.WriteLine("Starting asynchrnous correction.");
  18:     var results = new ConcurrentBag<ComputationResult>();
  19:     Parallel.ForEach(inputs, input =>
  20:         {                    
  21:             //Some more printouts and work here
  22:             var result = computation.Compute(input);
  23:             //Some more printouts and work here
  24:             results.Add(result);                    
  25:         });
  26:     return results.ToList();
  27: }

And so, the first method computes stuff in one thread, and the second does the same computation on the same inputs, but inside a parallel loop. This is all nice and simple, but I was a little bothered by the fact the the code inside the loop in the two methods repeats itself. The logical conclusion would be to extract a method, of course. This should look something like this:

   1: private static void ComputeAndAdd(string input, ICollection<ComputationResult> results)
   2: {
   3:     //Some more printouts and work here
   4:     var result = computation.Compute(input);
   5:     //Some more printouts and work here
   6:     results.Add(result);                        
   7: }

Only, you can’t write this method. If you try to pass it a ConcurrentBag<T> it will not compile, since that class doesn’t implement ICollection<T> nor IList<T>. This seemed strange. You can add and remove items from this collection. You can iterate it, why doesn’t it implement a more robust interface (you only get IEnumerable<T> and the non-generic ICollection, which doesn’t have an Add method). Baffled, I went and did the logical thing: asked in StackOverflow.

At first there was a suggestion to use SynchronizedCollection<T>, which does implement the interface I needed. But this is and older class that exists since .NET 2.0, and it isn’t as optimized as the newer ConcurrentBag. Also, it doesn’t really answer the question.

Later, user Rick Sladkey provided a better answer, and I’ll let him do the talking here:

A List<T> is not concurrent and so it can implement ICollection<T> which gives you the pair of methods Contains and Add. If Contains returns false you can safely call Add knowing it will succeed.

A ConcurrentBag<T> is concurrent and so it cannot implement ICollection<T> because the answer Contains returns might be invalid by the time you call Add. Instead it implements IProducerConsumerCollection<T> which provides the single method TryAdd that does the work of both Contains and Add.

So even though it seems That ConcurrentBag<T> implements the contract of ICollection<T>, it really doesn’t. There’s an underlying contract, that isn’t expressed in the interface’s methods, which ConcurrentBag<T> simply can’t implement. Rick then goes on and suggests using delegates to solve the above problem, a decent solution.

Posted by dorony | with no comments

We’ve hit an annoying issue a while ago. We are publishing a WCF web service which is consumed by several clients, and one of them failed to see leading and trailing spaces when reading the message. In order to solve this, without having to change and redistribute the client, we wanted to return some of the strings wrapped by a CData element. WCF doesn’t have a builtin mechanism for this, but we’ve found a suggestion to use the CDataWrapper class to acheive this.

This works, but it also changes the schema. In fact, .NET clients will see the CDataWrapper class as a DataSet in the generated proxy. In order to make a CDataWrapper property appear as a string in the WSDL, we use the XmlSchemaProvider attribute. Here is the entire CDataWrapper class we used:

   1: [XmlSchemaProvider("GetSchema")]
   2: public sealed class CDataWrapper : IXmlSerializable
   3: {
   4:  
   5:     public static XmlQualifiedName GetSchema(XmlSchemaSet xs)
   6:     {           
   7:         return XmlSchemaType.GetBuiltInSimpleType(XmlTypeCode.String).QualifiedName;
   8:     }
   9:  
  10:     // implicit to/from string
  11:     public static implicit operator string(CDataWrapper value)
  12:     {
  13:         return value == null ? null : value.Value;
  14:     }
  15:  
  16:     public static implicit operator CDataWrapper(string value)
  17:     {
  18:         return value == null
  19:                    ? null
  20:                    : new CDataWrapper
  21:                          {
  22:                              Value =
  23:                                  value
  24:                          };
  25:     }
  26:  
  27:     public System.Xml.Schema.XmlSchema GetSchema()
  28:     {            
  29:         return null;
  30:     }
  31:  
  32:     // "" => <Node/>
  33:     // "Foo" => <Node><![CDATA[Foo]]></Node>
  34:     public void WriteXml(XmlWriter writer)
  35:     {
  36:         if (!string.IsNullOrEmpty(Value))
  37:         {
  38:             writer.WriteCData(Value);
  39:         }
  40:     }
  41:  
  42:     // <Node/> => ""
  43:     // <Node></Node> => ""
  44:     // <Node>Foo</Node> => "Foo"
  45:     // <Node><![CDATA[Foo]]></Node> => "Foo"
  46:     public void ReadXml(XmlReader reader)
  47:     {
  48:         if (reader.IsEmptyElement)
  49:         {
  50:             Value = "";
  51:         }
  52:         else
  53:         {
  54:             reader.Read();
  55:             switch (reader.NodeType)
  56:             {
  57:                 case XmlNodeType.EndElement:
  58:                     Value = ""; // empty after all...
  59:                     break;
  60:                 case XmlNodeType.Text:
  61:                 case XmlNodeType.CDATA:
  62:                     Value = reader.ReadContentAsString();
  63:                     break;
  64:                 default:
  65:                     throw new InvalidOperationException("Expected text or CData but was: "+ reader.NodeType);
  66:  
  67:             }
  68:         }
  69:     }
  70:  
  71:     // underlying value
  72:     public string Value { get; set; }
  73:  
  74:     public override string ToString()
  75:     {
  76:         return Value;
  77:     }
  78: }

In order to use this in an object you want to send over the wire, you will write something like this (again, taken entirely from here):

   1: // example usage
   2: [DataContract(Namespace="http://myobjects/")]
   3: public sealed class MyType
   4: {
   5:     public string SomeValue { get; set; }
   6:     [DataMember(Name = "SomeValue", EmitDefaultValue = false)]
   7:     private CDataWrapper SomeValueCData
   8:     { 
   9:       get { return SomeValue; }
  10:       set { SomeValue = value; }
  11:      }
  12: }

And that’s all there is to it.

Posted by dorony | with no comments

Dave Bouwman noticed that ESRI has annouced that the Web ADF will be deprecated after ArcGIS 10. To me that’s not very surprising. I’ve blogged about my dislike to this framework more than once, and to me its death cannot come any sooner. I guess ESRI noticed that the simplicity of the REST API’s is winning the customers, and decided to back off the wretched thing. So is this good news? Ultimately, yes. The less people trying to make software with the ADF, the better.

But it is a mistake that ESRI should have noticed sooner. To think about the amount of changes the ArcGIS Server APIs went through since version 9.1 makes my head hurts. Upgraded from 9.1 to 9.2? Poof! The framework completely changed and WebADF with all new controls was born. 9.2 to 9.3? Poof! The Ajax framework got replaced and most of your Javascript code can go to the recycle bin. Wanted to change to the new and much better Javascript API? Better get on the dojo train and rewrite everything yet again.

If you were to try to take advantage of the greatest new features of each release, you probably had to rewrite your application every time. ESRI’s idea of backward compatability tends to be “let’s hope our customers won’t notice that we changed everything yet again”.

Posted by dorony | 2 comment(s)
תגים:
More Posts Next page »