eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Persistent URLs: really easy (thank you open-uri, SOAP4R, Ruby)

Using google (or any other search engine) to generate persistent URLs is one of those obvious ideas that make you wonder if you came to them on your own before being exposed. At any rate, I had never seen an implementation*1, so here's mine.

But first of all, some examples of the persistent URLs created by the script shown below:

It doesn't always work that well; for instance, the persistent URL of my Ruby 1.9 change summary (the first hit for http://google.com/search?q=ruby+1.9 ), becomes http://google.com/search?q=ruby+foo+file+method+nil+array+index+proc+def+methods .

Implementation

This is pretty easy, all one needs to do is:

  • extract candidate search terms from the desired destination URL:
    • only consider text
    • try to find significant terms
  • check against google, verifying if the chosen query is good enough

Extracting text from arbitrary HTML pages

There's no need for a full parse tree of the HTML: just the list of words that would be considered by google will do.

I took some old code of mine, from one of my very first (useful) Ruby scripts (a filtering proxy that added hints to German pages, inspired by jisyo.org, which does the same for Japanese text). It just uses a number of regexps to reject unwanted parts of the text, until we're left with simple words. It's not too inefficient thanks to strscan, and as naïve as the regexps might seem, they work well in practice:

require 'strscan'

REGEXPS = [
  # first rejected, second accepted; if array, repeat
  /\s*<\s*(?:script|option|style)\s*[^>]*?>(.*?)<\/(?:script|option|style)>\s*/im, 
  [/\s*<!--.*?-->\s*/m,
    [/\s*<[^>]*?>\s*/m,
# ignore the iso-8859-1 stuff, that made sense in the orig. code though
      [/[^a-zA-ZáéíóúñÑäëïüößçÁÉÍÓÚàèìòùÀÈÌÒÙÄËÏÖÜ&;<>]+/im, 
       /([a-zA-ZáéíóúñÑäëïüößçÁÉÍÓÚàèìòùÀÈÌÒÙÄËÏÖÜ]|\&([^;<]+?);)+/i
      ]
    ]
  ]
]
FALLBACK_REGEXPS = [/[^>]?\s+/, /[^<>]/]

def each_word(text)
  scanner = StringScanner.new(text)
  while scanner.rest?
    regexp_pair = REGEXPS
    catch(:done) do 
      until Regexp === regexp_pair.last 
        reject, accept = regexp_pair
        if scanner.scan(reject)
          # reject text
          throw :done
        else
          regexp_pair = accept
        end
      end
      # check the last RE; if it doesn't match, we reject this
      if scanner.scan(regexp_pair.last)
        # TODO: decode entities
        yield scanner.matched.downcase unless /\&([^;<]+?);/ =~ scanner.matched
      elsif FALLBACK_REGEXPS.any?{|re| scanner.scan(re)}
        # ignore
        throw :done
      else
        # bad bad bad
        return
      end
    end # catch :done
  end
end

Selecting significant terms

You can do some fancy stuff in this phase, using for instance latent semantic analysis (LSA), but I went for an utterly simplistic approach: just choose words based on their frequencies, discarding the most common words in English (taken from an online list):

common_words = DATA.readlines.map{|l| l.chomp.downcase}
sorted_words = nil

require 'open-uri'

if ARGV.size != 1
  puts "ruby google_urn.rb <URL>"
  exit
end

open(ARGV[0]) do |file|
  w_hash = Hash.new{|h,k| h[k] = 0}
  each_word(file.read){|word| w_hash[word] += 1 }
  sorted_words = w_hash.keys.sort_by{|x| w_hash[x]}.reverse
end

puts "Top 20 words: #{sorted_words[0...20].join(" ")}"
search_terms = (sorted_words - common_words)
puts "Search terms: #{search_terms[0..10].join(" ")}"

Evaluating search candidates

Once again, this is really easy, thanks to SOAP4R. To say the truth, I cut & pasted that snippet from a ruby-talk posting of mine, but the original code was mostly stolen from the nadoka IRC proxy/bouncer, which was written by ko1. [By the way you can help him finish YARV by taking over nadoka's maintenance; he was looking for a new maintainer a few months ago, don't know if that still holds.]

The following code just keeps adding terms to the query until the desired URL gets the first hit:

require 'soap/wsdlDriver'
def google_for(url, terms)
  puts "Googling for #{url} with #{terms.inspect}."
  g = SOAP::WSDLDriverFactory.new('http://api.google.com/GoogleSearch.wsdl').create_rpc_driver
  g.generate_explicit_type = true
  google_key = File.read(File.join(ENV["HOME"], ".google_key"))

  r = g.doGoogleSearch(google_key, terms.join(" "), 0, 10, false, "", false, "", "", "" )

  r.resultElements.each_with_index do |e, i|
    if [url, url + "/"].include? e['URL']
      return [i, r.estimatedTotalResultsCount] 
    end
  end

  puts "1st hit: #{r.resultElements[0]['URL']}"
  nil
# the request fails in weird ways; it usually works the second time...
rescue SOAP::HTTPStreamError, WSDL::XMLSchema::Parser::UnknownElementError
  retry
end 

1.upto(10) do |nterms|
  terms = search_terms[0...nterms]
  idx, total = google_for(ARGV[0], terms)
  if idx == 0
    puts "GOOGLE URN: http://google.com/search?q=#{terms.join("+")}"
    break
  end
end

Some examples

Of course, this makes little sense with pages that change often, such as blogs:

but it works great for archival areas:

Full code

"Just run it": google_url.rb


Direct Persistent URL's - Alan (2006-06-16 (Fri) 07:38:31)

You can use the google's "I'm Feeling Lucky" button to go directly to your page.

Here is the modifed first example: http://www.google.com/search?q=ruby+rb+challenge&btnI=I%27m+Feeling+Lucky

which simply appends the following to any google query: &btnI=I%27m+Feeling+Lucky


Last modified:2006/03/29 08:04:13
Keyword(s):[ruby] [blog] [persistent] [url] [google] [search] [snippet] [frontpage]
References:[Ruby]

*1 I didn't care to read http://purl.org/ 's more general one...