eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Related document discovery, without algebra

You have heard about latent semantic analysis (if you haven't, take a look at this nice article on a SVD recommendation system written in 50 lines of Ruby). And you told yourself "hey, this is cool", to file it in your head right away. Or maybe you tried to actually use it, but were scared off by the algebra 101 part, or got lazy when you realized you needed to compile LAPACK, GSL or some other numerical library*1.

But you can get pretty far even without dimensionality reduction. If the feature space (e.g. the terms/concepts associated to your documents) is small enough, and you make sure synonymy is not a problem, you can do without algebra. One such case is that of your blog postings and their tags.

LSI is about reducing the dimensionality of a sparse term-document matrix, mitigating synonimy (different terms referring to the same idea) and polysemy (a word having multiple meanings). A program would do it using singular value decomposition, but you're also performing dimensionality reduction each time you tag your articles, mapping the text to a small number of keywords.

This means that you can use your tags to compute the cosine similarity between your posts, and find related pages in a dozen lines of code. The code that creates the "see also" box in eigenclass.org's pages looks essentially like this:

documents = {
  "ruby 1.8.5 released"  => %w[ruby 1.8.5 release changelog],
  "changes in ruby 1.9"  => %w[ruby 1.9 changelog],
  "rcodetools"           => %w[ruby rcodetools xmpfilter code completion annotation tags ri],
  "fastri"               => %w[ruby ri],
  "rcov"                 => %w[ruby code coverage rcov],
  "vim+ruby screencasts" => %w[ruby vim screencast rcov xmpfilter simplefold],
  # ...
}

def related_docs(feature_vectors, name)
  pair = feature_vectors.assoc(name)
  return nil unless pair
  feature_vectors.sort_by do |n, v|
    corr = 0 # inject is slower and often less clear
    pair.last.each_with_index{|val, i| corr += val * v[i]} 
    - corr
  end.reject{|n, v| name == n}
end

all_tags = documents.inject([]){|s,(k,v)| s.concat(v) }.uniq.sort
feature_vectors = documents.map do |name, tags| 
  # the vectors should be normalized so that the cosine distance is just the
  # dot product, but this seems to work well in practice even without
  # normalization
  [name, all_tags.map{|x| tags.include?(x) ? 1 : 0}]
end

p related_docs(feature_vectors, "rcov")
# >>
# [["vim+ruby screencasts", [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1]], 
#  ["rcodetools", [0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1]], 
#  ["ruby 1.8.5 released", [1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]], 
#  ["fastri", [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]], 
#  ["changes in ruby 1.9", [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]]]

Can it be done better? Sure, if I have so many nodes that synonymy becomes a problem (I forget old tags and start to use new ones) it will perform poorly. But it's still a very good mileage for a dozen lines of code, and it's better than nothing. When was the last time you searched the chronological archives of some blog?



classifier - Cameron (2007-01-20 (Sat) 13:49:53)

Hmmm, you're right. The "latest" versions available aren't working well. Classifier's LSI had some nice dev ideas and some tweaks but we never finalized our efforts and consolidated into a consistent tree. Sad, really. We all seemed to move off in different directions before cleaning house.

Maybe I'll have a look and ping Dave and Lucas to resolve this. Classifier is fun to play with. And the summarizer with LSI was pretty interesting.

Thanks for the reminder, and the interesting posts!

mfp 2007-01-20 (Sat) 17:44:40

I'm looking forward to seeing classifier work again. Its LSI is the real thing and I'll be happy to use it instead of the approximation with tags :) (I think many ppl will appreciate classifier's ability to operate in pure-Ruby mode without rb-gsl)

Cameron 2007-01-20 (Sat) 21:24:44

well, LSI in pure ruby works, but is really more of a toy. A full SVD on anything large is computationally intensive. Even using something like GSL or an NArray extension (or LAPACK interface, etc.) its scaling is a bit ..um.. difficult.

Dave (Fayram) spent some time trying to address this by having essentially a cluster prebuild a persistent index (aka the SVD part), and streamline the search (which doesn't require the SVD). I believe this work was for something much larger than a typically use - but the limitations of LSI are prevalent.

This is the reason I think approximations (like what your post highlights) and Rudi's work are both exciting and useful.

But that aside, we really should get classifier patched up.


another alternative to algebra - Rudi Cilibrasi (2007-01-20 (Sat) 20:08:35)

Another alternative to algebra/LSA/LSI/SVD is to use NCD, c.f. http://complearn.org/

My research is all about using data compression programs to replace the hard math in statistics. Sometimes, in certain domains (like document clustering usually) it works well.

Cameron 2007-01-20 (Sat) 21:08:48

Rudi,

NCD looks very interesting. I was actually considering seeing how it worked and if it could be fit into classifier itself (a while ago when I looked over your site read a bit of your thesis).

In any case, the approach of using data compression is brilliant. The "hard math" is really a large hurdle, and utilizing things like google compression distance is a very cool idea.


Use SQL? - Ilya Grigorik (2007-01-24 (Wed) 14:38:54)

Very cool, I love the ruby-foo. However, you're not doing dimensionality reduction, you could probably offload the entire vector-space model to SQL! This way you get all the benefits through a single query.

http://www.howtodothings.com/computers/a1259-word-searching-in-mysql.html



Last modified:2007/01/20 06:05:46
Keyword(s):[blog] [ruby] [frontpage] [eigenclass.org]
References:

*1 also, classifier was broken the last time I tried it...