At Ginger we use a large index of n-grams, which is basically a sequence of words and their frequency in our corpus. We wanted to make this index searchable, so naturally, we defaulted to using Lucene, which is the most popular open source IR library. This is how we started adding documents to the index:
As you can see, we are indexing and storing both the ngram and the frequency fields, where one is a string and the other is a long.
After the indexing was done, we wanted to search the index. Our main use case for searching is this: take a bunch of words, and return the most frequent n-grams that contain them. My natural inclination was to use sorting for this, by using IndexSearcher.search(Query query, int n, Sort sort). This allows us to supply a Sort object that would sort by the frequency field in descending order.
This turned out to be a bad idea. Lucene handles sorting by using the FieldCache. This means it tried to load all the frequencies of all the n-grams in the index into an array. This is a lot of data to load, and an OutOfMemoryException was thrown. But then I realized that I always wanted results to return ordered by frequency. I don’t need different result sets to be sorted by different fields, so this information should be stored in the index itself and shouldn’t require a cache.
Therefore, I arrived at the concept of Scoring. Scoring in Lucene determines how results are ordered by default, which is exactly what I needed. The default algorithm attempts to return the results most relevant to a given query, and I wanted to spice it up with my frequency information. The way you do that in Lucene is by using document level boosting. This allows you to say that a specific document is more relevant than others. At first I did the naive thing and set the boost to be the frequency, like this:
It even seemed to work for a while, but then I noticed a couple of issues:
- The boost factor was too dominant: Lucene sometimes preferred to return n-grams with less query words, but have a greater frequency, over n-grams with more query words, but with less frequency. So, for instance if I search for “pleasure meet you” it would prefer “I want to meet you” (100000 frequency) over “pleasure to meet you” (10000 frequency). This caused me to revise my perspective: I don’t always want results to be returned ordered by frequency, the amount of contained query words is an important factor as well.
- Sometimes results weren’t exactly ordered by frequency. An n-gram with 1000 frequency seemed to have exactly the same score as an n-gram with 10000 frequency, even though it was given a higher boost factor.
While the first issue was rather obvious, the second one had me baffled for a while. It turns out even though the boost value is a float, it is stored as only a byte, as the document norm. This value combines the document boost, the field boost (individual fields can be boosted as well, not useful for my case) and a length factor, which basically says that shorter documents are better (again not relevant for me, all the n-grams in a single index have the same length). So basically, we took a long (the frequency), which has a huge range of different values, and stuffed it into a byte which only allows for 256 different values. Obviously, I’m going to lose a bit of precision here.
The solution to both issues was to compute the boost in a different way. Instead of setting it to the frequency, we will set it to some function of the frequency. This should achieve:
- Smaller values for the boost factor, so it doesn’t override the coordination factor, which causes documents with more query terms to get a higher score.
- A better ordering of n-grams by frequency. An n-gram with 2000 frequency should get a higher score than an n-gram with 1000 frequency, but I don’t care if an n-gram with 110000 frequency has the same score as 100000 frequency. In essence, I can afford losing precision at the higher ranges of frequency, but less so in the lower ranges.
The function we came up with looks something like this:
Basically, we want the boost to be a number between 1 and 255 (which achieves our first goal of smaller boost values). By taking a log of the frequency (base 1.5) we are not only getting smaller values, but also achieving our second goal, of better precision for lower frequency values. Above a certain threshold, all the n-grams will have the same boost factor of 255.
At this point I thought I was done, but I was still seeing precision loss. Lucene’s default way of encoding a float to a byte and vice versa didn’t suit me; It tries to represent a large range of values when I just need to use 1-255. Therefore, I had to create my own Similarity class which overrides this behaviour.
First, since the length factor is irrelevant to us, we override computeNorm in order to ensure that norm = boost only. Second, as you can see, the encoding method is quite simple, and just casts the float to byte. We know for sure that the boost is in the range of 1-255 and that norm = boost, so we can safely cast. The decoding method is a little weirder; Apparently Java’s byte is signed (unlike C#’s byte which is unsigned). A simple cast back from byte to float would have a given us a negative number, hence we apply the 0xFF mask.
We now just have to apply our Similarity class to IndexWriter, like this:
It is also important to remember to use the custom Similarity class when searching (that’s when decodeNormValue is called):
And now we’re done.
My only niggle about Lucene is the documentation; It has excellent documentation, but only in the form of a book (which I read and enjoyed). But a PDF book is a lot less cosy to work with than something that is truly online (like blog posts). You can’t google and get results from Lucene In Action. Other than that, I would say that Lucene has been a perfect fit for our purpose. It is easily customizable, so we could tune it to our specific needs, and gives excellent query performance. Open source is awesome, isn’t it?