Naïve suffix array search engine vs. Ferret vs. Lucene
Motivated by
this performance comparison of Montezuma vs. Ferret and Lucene,
I tried to see how my toyish full-text search engine
based on suffix arrays fares against Ferret:

If the relative performance between Ferret and Lucene shown in the Montezuma benchmark holds, my toy engine is about 2x faster than Lucene when indexing small corpora.
Here are the scripts I used (taken from Ferret's wiki):
Searching
As expected, searching is quite slow. With a hot cache, this is what I get when searching half of linux-2.6.12's sources:
$ ruby lookup.rb test_FULLTEXT test_INDEX Input term: sprintf Needed 0.002611 seconds for the first match. Needed 1.71507 to collect 5441 hits (in 1118 docs).
I don't think there's much I can do to get below those 2.6ms for the first match, but collecting all the hits is way too slow, taking 0.3 ms per occurrence. This was fully expected because of the way the document URI for a match is obtained: it's essentially stored past the document data, so (as much as) the whole document is read before its URI is found. This was OK with the small RDoc corpus (with small articles) used by FastRI, but it's way too slow when there are many hits and documents are longer*1 Moreover, the document metadata is being loaded once per hit.
Also, when the cache is cold the first match takes longer --- maybe as much as
1/10th of a second. This is due to the
non-sequential accesses to the text during the binary search. Since suffix
array algorithms implicitly assume that the suffix array will be in main
memory, I can make the structure a bit richer to avoid at least some disk
seeks. There are some
(m == pattern length) searching
algorithms, but they'll certainly involve a high constant overhead in Ruby, so
for the corpora sizes I'm aiming at (things in the GB range) a simple revision
of the basic
algorithm will perform
better (in less code).
*1 Consistently with that hypothesis, the per-hit times are smaller for FastRI's shorter documents.
- 25 http://www.artima.com/forums/flat.jsp?forum=123&thread=185991
- 23 http://anarchaia.org
- 20 http://rubyblogs.org/aggy/tag/ferret
- 17 http://planetruby.0x42.net
- 16 http://rubyblogs.org/aggy/tag/array
- 11 http://hoscilo.pypla.net/2007/06/23/full-text-search-w-railsach-ferret
- 9 http://www.anarchaia.org
- 8 http://www.artima.com/buzz/community.jsp?forum=123
- 7 http://www.planetrubyonrails.com
- 7 http://hoscilo.pypla.net
Keyword(s):[blog] [ruby] [frontpage] [subpar] [suffix] [array] [ferret] [lucene] [search] [engine] [benchmark]
References:[Indexing faster than Ferret with some algorithmic help (an order of magnitude faster than Lucene? can't be)]