eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Doing some n-gram analysis over Ruby's docs

The first attempts to optimize my pure-Ruby, 200LoC full-text search engine based on suffix arrays (which evolved into the in-progress FTSearch) led me to perform some n-gram analysis over Ruby's core/stdlib docs (as well as a pack of gems).

With a suffix array, locating a word in the corpus involves a binary search of complexity O log N, N being the total number of suffixes. Since the suffix array (SA) is but a large array of integer values representing offsets in the "full text" file (which correspond to the start of the suffixes), you need to read the text at the position indicated in the SA in order to compare it to the term you're looking for.

Therefore a corpus like the Linux sources (with some 20 million suffixes) would require about 25 disk seeks even if the suffix array itself were held in memory. At 10ms the seek, that would be 250ms when the file isn't cached...

The first idea that came to mind to minimize the number of seeks was storing the first n bytes of text right after the offset in the SA:

 ...
 offset(N)    abcdefgh
 offset(N+1)  abcdefgh      (first 8 bytes coincide)
 offset(N+2)  abcdefgi
 ...

Ideally, if suffixes covered the whole "character space" uniformly, n bytes would represent n * 8 fewer disk seeks. So n = 4, which would double the size of the SA (offsets are 32 bit integers), would be enough to find any word in a corpus under 4GB without a single disk seek when the SA is held in memory.

Of course, the distribution of suffixes is hardly uniform. We tend to repeat a limited number of words instead of writing random byte strings... I did some basic n-gram analysis over the documentation from the core/stdlib and some ~80 gems; character-wise to being with (since this is what matters for the full text search engine) and then word-based while I was at it.

This analysis proved that such "inline suffixes" add a large space overhead to the index with relatively little gains, so I adopted a slightly different solution.

Counting n-grams

If you have FastRI and have used it to build a full-text index of your core/stdlib/gem docs, your ~/.fastri-fulltext/ directory will hold a "suffixes.dat" file (just a large array of 32 bit ints) and a "full_text.dat" file which holds the actual indexed documentation.

Collecting the n-grams is easy:

MAX_SUFFIX_SIZE = 64
NGRAM_SIZES = [4, 8, 10, 12]

suffixes = []

File.open("full_text.dat", "rb") do |ft|
  File.open("suffixes.dat", "rb") do |f|
    until f.eof?
      offset = f.read(4).unpack("V")[0]
      ft.pos = offset
      suffix = ft.read(MAX_SUFFIX_SIZE).split(/\0/)[0]
      suffixes << suffix
    end
  end
end

puts "Total suffixes: #{suffixes.size}"

ngrams = Hash.new{|h,k| h[k] = Hash.new{|h2,k2| h2[k2] = 0} }
all    = {}
suffixes.each do |suffix|
  all[suffix] = true
  NGRAM_SIZES.each{|n| ngrams[n][suffix[0,n]] += 1 }
end

puts <<EOF

===============================
Character-based n-gram analysis
===============================
#{all.size} unique #{MAX_SUFFIX_SIZE}-grams
A lookup would take on average #{Math.log(suffixes.size)/Math.log(2)} disk seeks.

EOF

With my index, I get:

Total suffixes: 224687
196625 unique 64-grams
A lookup would take on average 17.7775571295372 disk seeks.

n-gram statistics

At this point, ngrams[n] is a hash associating n-grams to number of occurrences, and we can try to obtain some interesting statistics:

  • number of suffixes which start with the same n bytes (the same n-gram)
  • mean, median, maximum number of suffixes per n-gram, as well as stddev
  • how many disk seeks we can expect to save on average

The latter is actually the entropy of the n-gram distribution. If there's only one n-gram, knowing the first n bytes doesn't give you any usable info (the entropy is 0), and you don't save any disk seek, all you've done is increase the size of the SA with no gain whatsoever. Now, if there are two n-grams, 50% of the suffixes start with one and 50% with the other, then a binary search will take 1 disk seek less (you just saved 10ms). If we consider a n-gram, we can save at most 8 times n seeks, iff there are 2 sup { 8 times n } suffixes and each starts with a different n-gram.

The gain (number of disk seeks saved) can thus be computed as

sum from i to M - { { n sub i } over S } log sub 2 { { n sub i } over S }

if there are M distinct N-grams each with n sub i suffixes, n sub i is the number of suffixes starting with the i-th n-gram, and S = sum from i to M n sub i.

def output_ngram_stats(ngrams, n)
  h = ngrams[n]
  nsuffixes = h.values.inject{|s,x| s+x}
  sq_avg = 1.0 * h.values.inject(0){|s,x| s + x ** 2} / h.size
  avg = 1.0 * nsuffixes / h.size
  entropy =  h.inject(0){|s,(k,c)| p = 1.0 * c / nsuffixes; s - p * Math.log(p)} / Math.log(2)
  puts <<E
Number of #{n}-grams: #{h.size}
Mean:    #{avg}
Median:  #{h.values.sort[h.size / 2]}
Maximum: #{h.values.sort.last}
Stddev.: #{Math.sqrt(sq_avg - avg ** 2)}
Gain   : #{entropy} bits

E
end


NGRAM_SIZES.each{|n| output_ngram_stats(ngrams, n)}

I get this:

Number of 4-grams: 14616                Number of 10-grams: 104822
Mean:    15.3726737821565               Mean:    2.14350995020129
Median:  2                              Median:  1
Maximum: 11857                          Maximum: 469
Stddev.: 121.607374446762               Stddev.: 5.35090264226657
Gain   : 10.6729310320304 bits          Gain   : 15.7399473128775 bits
                                        
Number of 8-grams: 73023                Number of 12-grams: 130326
Mean:    3.0769346644208                Mean:    1.72403818117643
Median:  1                              Median:  1
Maximum: 1257                           Maximum: 465
Stddev.: 11.1136402566513               Stddev.: 3.27607549288307
Gain   : 14.6709814769394 bits          Gain   : 16.3731975079201 bits

This means that storing the first 4 bytes of text would save about 11 disk seeks (100 ms)*1. That's much lower than the 32 bit maximum.

Storing 8 bytes in each "inline suffix" only saves another 4 seeks, so the extra 4 bytes (4 seeks) aren't nearly as useful as the first 4 (10 seeks). This is expected, as when the first few letters of two words coincide, the probability of the next few also being the same is higher. Note how the gain decreases to about 1 bit per char between 4-grams and 8-grams, which is close to the entropy of English text, estimated between 0.6 and 1.3 bits per char.

Word-based n-grams

Using the first English stop_words list google gave me to reject suffixes starting with a stop-word, I got the top 2,3,4-grams in Ruby's documentation; no big surprises:

WORD_NGRAM_SIZES = [2, 3, 4]
word_ngrams = Hash.new{|h,k| h[k] = Hash.new{|h2,k2| h2[k2] = 0} }
stopwords = {}
File.foreach("stop_words"){|l| stopwords[l.chomp] = true}
all.each do |suffix, |
  words = suffix.split(/\s+/).map{|x| x.downcase}
  WORD_NGRAM_SIZES.each do |siz|
    next unless words.size >= siz
    w = words[0,siz]
    next if stopwords[w.first]
    word_ngrams[siz][w] += 1
  end
end

WORD_NGRAM_SIZES.each do |siz|
  puts "Top #{siz}-grams:"
  word_ngrams[siz].sort_by{|k,v| -v}[0...20].each do |words, count|
    puts "%6d  %s" % [count, words.join(" ")]
  end
  puts
end

499  returns the        170  end end            125  defaults to
281  returns a          158  used to            118  returns an
278  true if            141  creates a          118  return the
234  alias for          139  create a           114  method is
231  returns true       139  value of           108  block is
220  array of           137  list of            107  using the
197  number of          137  does not
203  returns true if             34  used as the
111  true if the                 33  wed apr 09
 94  returns a new               32  returns a string
 84  returns an array            32  based on the
 76  creates a new               30  added to the
 64  create a new                30  block is given,
 59  value of the                28  allows you to
 45  returns the number          27  select * from
 44  end of the                  26  h = {
 35  returns nil if              26  according to the
 92  returns true if the          18  time.now #=> wed apr
 63  returns an array of          18  b' => 200 }
 44  returns the number of        17  100, 'b' => 200
 25  true if the named            17  returns true if stat
 22  available on all platforms.  17  returns a new array
 22  a' => 100, 'b'               15  returns a new time
 21  boolean | true if            15  t = time.now #=>
 20  h = { 'a'                    15  specified as a hash.
 20  a', 'b', 'c' ]               14  s = stringscanner.new('test string')
 20  returns the value of         14  block once for each

Top (char-based) n-grams

Here are the top n-grams for n = 4, 8, 12:

puts "=" * 80
puts "Top n-grams:"
NGRAM_SIZES.each{|n| p ngrams[n].sort_by{|k,c| -c}.map{|k,c| [k[/[^\0]*/], c]}[0..50] }

11857 the 	 1255 The 	  910 not 	  692 new 	  544 used
 2746 and 	 1187 valu	  872 in t	  625 data	  539 is t
 2433 for 	 1170 will	  863 cont	  618 give	  536 whic
 1756 of t	 1159 retu	  857 from	  615 test	  529 as a
 1528 meth	 1103 are 	  851 call	  604 all 	  528 arra
 1498 obje	 1091 clas	  843 end	  600 you 	  517 fals
 1447 name	 1040 to t	  776 is a	  578 inst	  499 inte
 1409 Retu	 1018 file	  747 spec	  574 if t
 1408 this	 1000 stri	  732 can 	  558 to a
 1383 with	  950 true	  721 opti	  547 defa
 1382 that	  927 This	  696 bloc	  545 File
1257 Returns 	  320 methods 	  275 associat	  231 This is 	  189 the valu
 546 will be 	  310 with the	  268 the curr	  230 exceptio	  188 of this 
 468 returns 	  305 from the	  266 characte	  223 and the 	  187 element 
 452 argument	  301 true if 	  260 an array	  218 Errno::E	  182 connecti
 419 specifie	  301 paramete	  257 objects 	  217 require 	  181 informat
 410 attribut	  300 the give	  253 returned	  210 This met	  180 elements
 388 instance	  292 transact	  248 director	  202 a string	  177 availabl
 352 Alias fo	  290 document	  242 number o	  200 the bloc	  177 variable
 329 for the 	  285 the name	  242 the same	  197 options 	  177 function
 326 default 	  284 current 	  236 represen	  195 array of	  172 of the c
 465 Returns the 	   99 the name of 	   69 value of the
 252 the current 	   98 the document	   65 an exception
 249 Returns true	   95 correspondin	   65 converted to
 210 This method 	   89 the associat	   64 Transaction:
 156 an array of 	   88 Returns a ne	   64 containing t
 136 true if the 	   88 documentatio	   64 false
 126 ActiveRecord	   84 automaticall	   61 Returns an a
 126 the number o	   83 the specifie	   61 is specified
 119 transaction 	   80 the value of	   61 defaults to 
 116 information 	   80 end		   61 true
 111 returns the 	   79 the attribut	   61 argument is 
 110 name of the 	   77 representing	   60 example.com'
 110 the default 	   76 Defaults to 	   58 the named fi
 107 the followin	   73 Creates a ne	   58 Julian Day N
 106 this method 	   72 transaction_	   57 the end of t
 105 produces:		   70 object.		   56 000\000\000\
 100 and returns 	   70 Equivalent t	   55 a', 'b', 'c'


Last modified:2006/12/24 04:50:54
Keyword(s):[blog] [ruby] [frontpage] [suffix] [array] [n-gram] [analysis] [ftsearch]
References:

*1 since that corpus is so small, most of the full text file will be cached after the first seeks, so it will take much less than 17 disk seeks normally. However, the decrease in the number of disk seeks will hold for much larger corpora if the N-gram distribution is similar.