A while ago we wanted to choose a new natural language parser for our product. One of the strongest candidates in that area is the Charniak-Johnson parser. We ended up not choosing to use this parser for various reasons, but in the process of evaluating it we produced a nice side benefit: we compiled it (well, the Charniak part of it) and ran it natively on Windows.
Now, before I tell you a bit about the conversion process, here are a few important notes:
- This is not the full CJ parser. This is the first-stage Charniak parser only (Aug06 version). The re-ranking second stage was not converted.
- We converted only the non-multithreaded version of the main loop (oparseit.c and not parseit.c). Still, the parsing code should support calls from multiple threads should you choose to do so (up to the MAXNUMTHREADS parameter which defaults to 4).
- This is a completely native version of the parser, it doesn’t need cygwin/mingw32 or anything of the sort.
- It does need, however, the Visual C++ 2012 Redistributables to run.
- You need Visual Studio 2012 to open the project (but you can probably easily port it to older VS versions).
- We are not actually using this in our product, anywhere. We thought about using it, but we’re not. This is provided ‘as is’ as a service to the community, and the original license of the parser remains in place.
So now that the formalities are out of the way, here are some details about the conversion process. First, I should note that I am no C++ expert. C# is my thing, and while I did some C/C++ projects in the university, it was quite a while ago. Therefore, it was not an easy task and it took me a few days to get the entire thing to work. Here are some things I had to do (probably not all, this was a long time ago):
- Convert all the .c files to .cpp files so Visual Studio will be happy with the code using C++ constructs.
- Fix header issues of all sorts (like using #stdafx.h everywhere).
- Add unicode support by changing all the constant strings to have the L macro applied on them, use wchar_t everywhere in the code, use wistream and friends instead of istream and friends, etc.
- Figuring out code and files that are not needed for pure parsing and get rid of them. The less files I have to touch, the better. So for instance, I got rid of all the EvalTree stuff in the original parser (a different command which evaluates parse trees. Didn’t need that).
The biggest headache, though, was the performance issue. For some reason, the parser performed much worse on Windows than on Linux, and it was pretty hard to understand why. I ended up using the Visual Studio C++ profiler, and traced the issue to a line that allocates a large array of objects. Digging a little further, I noticed that each allocated object contains a large std::list, so in fact, I was allocating lots of lists. The code basically boils down to this:
I measured the code above to be 10 times slower on Windows than on Linux. Funky, eh? At first I thought that memory allocation on Windows is slower than on Linux. Who knows, right? Well, the folks at StackOverflow helped me realize that memory allocation has nothing to do with this, it is actually the std::list constructor that is slower in the Microsoft implementation, probably because it allocates a bigger initial buffer. Once I knew this, I was able to tweak to code to dynamically allocate the std::lists, so I was able to create a lot less of them. This solved the performance issue, and in fact I measured the parser to be even faster on Windows that on Linux now.
That’s it for now, I guess. If you have any questions about using the binaries or the code posted above, please contact me.