eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

A simple full-text search engine in 200 lines of Ruby

update.png FTSearch source code available: indexes 3 times faster than Ferret

I'd been planning to add full-text search capabilities to FastRI from the beginning, and in Ruby-land "full-text" means Ferret. But I wanted to keep dependencies to a minimum, as FastRI could someday aspire to replace parts of the agonizing ri in the stdlib. There had to be a middle ground between risearch's simplicity and slowness and Ferret's nuclear-powered nutcracker.

I decided to write a straightforward full-text engine using suffix arrays, and the very first (utterly naïve) implementation was already fast enough for FastRI: taking ~10 seconds to index some 40 gems and the stdlib documentation, I was getting sub-millisecond query times. In a latter test, I indexed 20% of Linux' sources (why only 20%? because indexing was done in memory), which took half a minute, and queries were in the millisecond range. Not bad at all for a couple hundred lines of easy Ruby code.

Working principle

Suffix arrays

One way to find all occurrences of a word, "foo", in a document is finding all the character sequences that start with "foo", i.e. all the suffixes that start with "foo". How is that any faster than just searching linearly for "foo", in the first place? The trick is creating a lexicographically sorted suffix array. For instance, given the string "this is a document", you have all these suffixes:

str = "this is a document"
puts (0...str.size).map{|i| str[i..-1].inspect}.sort
# >> " a document"
# >> " document"
# >> " is a document"
# >> "a document"
# >> "cument"
# >> "document"
# >> "ent"
# >> "his is a document"
# >> "is a document"
# >> "is is a document"
# >> "ment"
# >> "nt"
# >> "ocument"
# >> "s a document"
# >> "s is a document"
# >> "t"
# >> "this is a document"
# >> "ument"

You can tell if a word is in that string in logarithmic time using binary search, by looking for a suffix starting with it in the suffix array. Of course, you'd normally represent suffixes in the suffix array as offsets in the string:

p (0...str.size).sort_by{|i| str[i..-1]}
# >> [7, 9, 4, 8, 12, 10, 15, 1, 5, 2, 14, 16, 11, 6, 3, 17, 0, 13]

Extending to multiple documents

Suffix arrays allow us to find all occurrences of a word in a document quickly, but how can this be extended to multiple documents? All you have to do is combine them into a larger one (which I'll call the "full text") , so you still use binary search over the suffix array, and find a way to determine which document you're referring to given an offset within the full text.

In my proof of concept, I just wrote the document name after its contents, using '\0' to delimit it:

  def add_document(name, contents)
    @fulltext << preprocess(contents)
    @fulltext << "\0#{name}\0"
  end

This way, once you get the offset of the suffix you're after, all you have to do is keep reading from the full text until you find the document name. This makes the search

O left ( M log N + P right )

where M is the length of the query term, N the number of indexed documents, and P the size of the document itself. In practice, the documents I'm going to index (method/class/module descriptions) are short, so the P term doesn't matter. Also, since the document name is placed after its contents, I can just read sequentially, which is much faster than placing it before; indeed, the next few blocks worth of data in the full text files will probably have been read & cached by the FS anyway.

Limiting the number of suffixes

If I proceeded as shown in the above example, given N bytes of full text data, there would be as well N suffixes. If the suffix array were encoded using 32 bit integers to represent offsets, it would take 4 times as many bytes as the full text. Also, since sorting using Ruby's Array#sort (a quicksort) is O left ( N log N right ), having too many suffixes will make indexing slower (and also searches, to a lesser degree).

Besides, not all suffixes are useful; consider the following ones from the first example: "nt", "cument", "ment", "ocument", "ument"... Clearly suffixes starting where a word begins are much more interesting.

Implementing it

Indexing

Here's the code that finds the suffixes, sorts them and writes the 32-bit offsets to a file:

File.open(@fulltext_file, "w"){|f| f.puts @fulltext }
scanner = StringScanner.new(@fulltext)

suffixes = []
until scanner.eos?
  start = scanner.pos
  text = scanner.scan_until(/\0.*?\0/)
  text = text.sub(/\0.*?\0$/,"")
  suffixes.concat find_suffixes(text, start)
  scanner.terminate if !text
end
puts "Suffixes: #{suffixes.size}"
t0 = Time.new
sorted = suffixes.sort_by{|x| @fulltext[x,MAXWORD_SIZE]}
File.open(@index_file, "w") do |f|
  sorted.each_slice(10000){|x| f.write x.pack("V*")}
end
#File.open("suffixes", "w"){|f| sorted.each{|i| f.puts @fulltext[i,MAXWORD_SIZE].inspect}}
puts "Processed in #{Time.new - t0} seconds"

It works by scanning the @fulltext, removing the document markers, finding the suffixes within a document (whose offsets are corrected using the displacement of the document in @fulltext) and sorting them before writing to disk.

A note about sorting
sort_by{|x| @fulltext[x,MAXWORD_SIZE]}

Is but an approximation; since we'd normally want to sort according to

@fulltext[x..end_of_doc_offset]

but

  1. determining end_of_doc_offset would be too slow
  2. the amount of memory used while sorting will be proportional to MAXWORD_SIZE

This limitation implies that suffixes with more than MAXWORD_SIZE characters in common at the beginning might not be sorted correctly, which means we could miss some results when the query term is longer than MAXWORD_SIZE. However, few (if any) words should be longer than 30 or 40 characters, so that's not much of a problem.

Finding suffixes within a document

As explained above, I'm only interested in suffixes marking word boundaries. These can be found using a couple regexps, representing words and things that cannot possibly belong to a word. A StringScanner does nicely:

require 'strscan'
def find_suffixes(string, offset)
  suffixes = []
  sc = StringScanner.new(string)
  until sc.eos?
    sc.skip(/([^A-Za-z_]|\n)*/)
    len = string.size
    loop do
      break if sc.pos == len
      suffixes << offset + sc.pos
      break unless sc.skip(/[A-Za-z0-9_]+([^A-Za-z0-9_]|\n)*/)
    end
  end
  suffixes
end

Searching

The basis of full-text searching using suffix arrays is the binary search:

def binary_search(indexIO, fulltextIO, term, from, to)
  middle = (from + to) / 2
  pivot = get_string(indexIO, fulltextIO, middle, MAXWORD_SIZE)
  if from == to
    if pivot.index(term) == 0
      indexIO.pos = middle * 4
      [middle, indexIO.read(4).unpack("V")[0]]
    else
      nil
    end
  elsif term <= pivot
    binary_search(indexIO, fulltextIO, term, from, middle)
  elsif term > pivot
    binary_search(indexIO, fulltextIO, term, middle+1, to)
  end
end

get_string just seeks to the specified position within the full text and reads from there.

Once you have the binary_search method, #lookup, which takes a query term and returns the first hit, becomes trivial:

def lookup(term)
  File.open(@fulltext_file, "rb") do |fulltextIO|
    File.open(@index_file, "rb") do |indexIO|
      index, offset = binary_search(indexIO, fulltextIO, term, 0, indexIO.stat.size / 4 - 1)
      if offset
        fulltextIO.pos = offset
        if path = find_path(fulltextIO)
          Result.new(self, term, index, path)
        else
          nil
        end
      else
        nil
      end
    end
  end
end

Result objects can return the text around the hit, making sure the document markers are removed.

Once you have get the first hit, all other occurrences of the query term follow in the suffix array, so you can get them in no time. And you could as well skip the first N hits, simply by moving forward in the suffix array (this is not as useful as it sounds though, once you start to compute a measure of relevance).

Extensions

This full-text search engine only allows you to search for a single term. But more complex (boolean) queries can be implemented easily on top of it. Also, metadata can be stored along with the "document markers" without changing the fundamental behavior. I implemented both in FastRI in little time.

It could also be extended to support fairly fast searches using regular expressions, as long as there's enough text at the beginning of the regexp to "anchor" the matches, allowing binary search to kick in.

Comparing to other search engines

At the end of the day, we have this tiny (basic) full-text search engine in ~200LoCs. It's not nearly as fast as it could get (naïve IO, pure Ruby...), but it doesn't score too bad against engines based on an inverted index like Ferret or Lucene:

  • it can do without advanced stemming, so looking for "program" will find "program", "programmer", "programming"...
  • building the index is quite fast and can be made to work when the fulltext doesn't fit in the memory by using e.g. merge sort to merge sorted sub-arrays
  • it can be trivially extended to add document metadata, relevance measures, etc.

The code

Look at indexer.rb and lookup.rb for the proof of concept that evolved into FastRI's full-text search engine. The actual code used by FastRI is under lib/fastri/.



My Brain Expanded - JEG2 (2006-11-16 (Thr) 06:57:51)

You have such a gift for taking complex computer science topics and presenting them so wimps like me can understand them. I learn more reading this blog than I did in years of school.

hgs 2006-11-16 (Thr) 10:53:23

Impressive stuff.

If you use something like

Struct.new("DocIndex", :filename, :substring)

then you wouldn't need to seek/search to the end to find the filename. I wonder if that would help performance at all?

mfp 2006-11-16 (Thr) 12:22:07

You have such a gift for taking complex computer science topics and presenting them so wimps like me can understand them.

I never thought I was a good writer/communicator; my style is bland and my lexicon poor --- I could blame my English, being my 3rd language, but I'm not sure I'm vastly better in any other (maybe my style or lack thereof accompanies me no matter the language?). But I'm glad you liked it anyway :) (BTW. in part eigenclass.org's point is that these are not complex topics, it's just that we're used to working/playing with even easier stuff.)

If you use something like Struct.new("DocIndex", :filename, :substring) then you wouldn't need to seek/search to the end to find the filename. I wonder if that would help performance at all?

hmm I don't see what you mean. Struct.new would be an in-memory structure, but the index is in a file. If document/metadata retrieval became a problem due to indexed documents being too large, I could use another file consisting of

offset docname pointer-to-metadata

records, and the document name could be found with binary search again (there would be only one record per document, so this could be kept in memory easily). Metadata could be stored in a separate file, or right after the data in the "full text file".

The "documents" FastRI has to deal with are typically under a hundred bytes long, so this isn't needed, but I might implement it nonetheless, only because hacking this is fun :)

hgs 2006-11-16 (Thr) 13:37:22

Oh, I forgot it wasn't in memory at the time you were accessing it. Never mind. :-)


Ruby quiz #54 - Alex (2006-11-16 (Thr) 09:38:36)

See also Ruby quiz 54 and Dave Balmain's comparison of entries (I can't link directly to the post, the permalink is broken.).

mfp 2006-11-16 (Thr) 12:29:45

Thank you, I'd forgotten that one. It's becoming easier as time goes on, with >100(!) quizzes...

I took a look at the solutions, and they all use an inverted index of some kind (either with a hash/trie or a bitmap with one bit per document). I might try to see how my suffix-array stuff ranks against them later (esp. if I keep developing it).


Unicode - roberthahn (2006-11-17 (Fri) 08:06:37)

I would totally want to use this, but I'm trying to figure out a means to set up a character equivalency. People might want to search for "allo" (French for hello), and expect to get "âllo" in the search results.

So far, the best idea I can think of is to construct a giant hash, where the keys are stringified Regexps, and the value is the ascii equivalent: { "[äâáåà]" => "a" }, and then iterate throught the hash on the source document, converting unicode characters to ascii.

It also seems like you'd have to treat all punctuation as insignificant somehow -- but then the full text repository isn't true to the original presentation, so the best you could do is return links to the file and let the user figure it out.

But then, you did say it was a *simple* full-text search engine... :)

mfp 2006-11-17 (Fri) 12:23:43

You can easily handle such "character equivalences" by changing the comparison function in the binary search method (e.g. by creating an approximate ASCII representation of the suffix before comparing), without changing the corpus itself.

As for punctuation, it's not being ignored:

Input term: 
foo.to_s
Needed 0.000856 seconds.
================================================================================
Found in /home/batsman/usr//share/ri/1.8/system/WeakRef/cdesc-WeakRef.yaml:
 
Usage:
  foo = Object.new
  foo = Object.new
  p foo.to_s                  # original's class
  foo

It's just that, by construction, there are no suffixes starting with a non-word character, so you can look for "bar * baz" but not for "-foo". It is possible to change the criteria used for "suffix selection" on a document basis (or even in different sections of a document) very easily. For instance, you could use n-gram analysis to locate words boundaries (just the start suffices) in a text written in Japanese, and arbitrarily sophisticated lexing to index source code.

This makes for some interesting hacking, so I'm going to see how far I can push it in terms of speed (avoiding disk seeks by enriching the suffix array with additional data, caching, general optimization), scalability (splitting the index into several fragments, supporting merging, document addition), functionality (boolean & regexp searches, relevance criteria, maybe principal component analysis?) and reliability (concurrent search/indexing, transaction support?), while keeping it in pure-Ruby. At that point it would justify a separate release.


Binary_search - M. Demazure (2006-11-18 (Samedi) 02:10:56)

Simpler code for binary_search, avoiding unecessary recursive calls.

def binary_search(indexIO, fulltextIO, term, from, to)

   while from < to
     middle = (from + to) / 2
     pivot = get_string(indexIO, fulltextIO, middle, MAX_WORD_SIZE)
     if term <= pivot then to = middle ; next
     elsif term > pivot then from = middle+1 ; next 
     end
   end
   pivot = get_string(indexIO, fulltextIO, from, MAX_WORD_SIZE)
   if pivot.index(term) == 0
     indexIO.pos = from * 4
     [from, indexIO.read(4).unpack("V")[0]]
   else
     nil
   end
 end

mfp 2006-11-18 (Sat) 16:51:51

better indeed since Ruby doesn't do tail-call optimization

Anonyme 2006-11-29 (Mercredi) 00:46:42

Yes. By the way, I used your prototype code for a small personal tool, and it was very useful. Thanks

M.D. 2006-11-29 (Mercredi) 00:47:37

Lst anonyme is me, forgot to put my name.