Persistent URLs: really easy (thank you open-uri, SOAP4R, Ruby)
Using google (or any other search engine) to generate persistent URLs is one of those obvious ideas that make you wonder if you came to them on your own before being exposed. At any rate, I had never seen an implementation*1, so here's mine.
But first of all, some examples of the persistent URLs created by the script shown below:
- my Ruby puzzle/challenge becomes http://google.com/search?q=ruby+rb+challenge
- the cheap rsync --append substitute in Ruby: http://google.com/search?q=ruby+file+rsync
- an explanation of sort_by{rand}'s bias: http://google.com/search?q=ruby+bigdecimal+n+rand
It doesn't always work that well; for instance, the persistent URL of my Ruby 1.9 change summary (the first hit for http://google.com/search?q=ruby+1.9 ), becomes http://google.com/search?q=ruby+foo+file+method+nil+array+index+proc+def+methods .
Implementation
This is pretty easy, all one needs to do is:
- extract candidate search terms from the desired destination URL:
- only consider text
- try to find significant terms
- check against google, verifying if the chosen query is good enough
Extracting text from arbitrary HTML pages
There's no need for a full parse tree of the HTML: just the list of words that would be considered by google will do.
I took some old code of mine, from one of my very first (useful) Ruby scripts (a filtering proxy that added hints to German pages, inspired by jisyo.org, which does the same for Japanese text). It just uses a number of regexps to reject unwanted parts of the text, until we're left with simple words. It's not too inefficient thanks to strscan, and as naïve as the regexps might seem, they work well in practice:
require 'strscan' REGEXPS = [ # first rejected, second accepted; if array, repeat /\s*<\s*(?:script|option|style)\s*[^>]*?>(.*?)<\/(?:script|option|style)>\s*/im, [/\s*<!--.*?-->\s*/m, [/\s*<[^>]*?>\s*/m, # ignore the iso-8859-1 stuff, that made sense in the orig. code though [/[^a-zA-ZáéíóúñÑäëïüößçÁÉÍÓÚàèìòùÀÈÌÒÙÄËÏÖÜ&;<>]+/im, /([a-zA-ZáéíóúñÑäëïüößçÁÉÍÓÚàèìòùÀÈÌÒÙÄËÏÖÜ]|\&([^;<]+?);)+/i ] ] ] ] FALLBACK_REGEXPS = [/[^>]?\s+/, /[^<>]/] def each_word(text) scanner = StringScanner.new(text) while scanner.rest? regexp_pair = REGEXPS catch(:done) do until Regexp === regexp_pair.last reject, accept = regexp_pair if scanner.scan(reject) # reject text throw :done else regexp_pair = accept end end # check the last RE; if it doesn't match, we reject this if scanner.scan(regexp_pair.last) # TODO: decode entities yield scanner.matched.downcase unless /\&([^;<]+?);/ =~ scanner.matched elsif FALLBACK_REGEXPS.any?{|re| scanner.scan(re)} # ignore throw :done else # bad bad bad return end end # catch :done end end
Selecting significant terms
You can do some fancy stuff in this phase, using for instance latent semantic analysis (LSA), but I went for an utterly simplistic approach: just choose words based on their frequencies, discarding the most common words in English (taken from an online list):
common_words = DATA.readlines.map{|l| l.chomp.downcase} sorted_words = nil require 'open-uri' if ARGV.size != 1 puts "ruby google_urn.rb <URL>" exit end open(ARGV[0]) do |file| w_hash = Hash.new{|h,k| h[k] = 0} each_word(file.read){|word| w_hash[word] += 1 } sorted_words = w_hash.keys.sort_by{|x| w_hash[x]}.reverse end puts "Top 20 words: #{sorted_words[0...20].join(" ")}" search_terms = (sorted_words - common_words) puts "Search terms: #{search_terms[0..10].join(" ")}"
Evaluating search candidates
Once again, this is really easy, thanks to SOAP4R. To say the truth, I cut & pasted that snippet from a ruby-talk posting of mine, but the original code was mostly stolen from the nadoka IRC proxy/bouncer, which was written by ko1. [By the way you can help him finish YARV by taking over nadoka's maintenance; he was looking for a new maintainer a few months ago, don't know if that still holds.]
The following code just keeps adding terms to the query until the desired URL gets the first hit:
require 'soap/wsdlDriver' def google_for(url, terms) puts "Googling for #{url} with #{terms.inspect}." g = SOAP::WSDLDriverFactory.new('http://api.google.com/GoogleSearch.wsdl').create_rpc_driver g.generate_explicit_type = true google_key = File.read(File.join(ENV["HOME"], ".google_key")) r = g.doGoogleSearch(google_key, terms.join(" "), 0, 10, false, "", false, "", "", "" ) r.resultElements.each_with_index do |e, i| if [url, url + "/"].include? e['URL'] return [i, r.estimatedTotalResultsCount] end end puts "1st hit: #{r.resultElements[0]['URL']}" nil # the request fails in weird ways; it usually works the second time... rescue SOAP::HTTPStreamError, WSDL::XMLSchema::Parser::UnknownElementError retry end 1.upto(10) do |nterms| terms = search_terms[0...nterms] idx, total = google_for(ARGV[0], terms) if idx == 0 puts "GOOGLE URN: http://google.com/search?q=#{terms.join("+")}" break end end
Some examples
- http://en.wikipedia.org/wiki/URN becomes http://google.com/search?q=urn+resource+uniform+wikipedia
- http://eigenclass.org/hiki.rb?RAA+vs+CPAN+cost is turned into http://google.com/search?q=ruby+locs+files+cpan
Of course, this makes little sense with pages that change often, such as blogs:
but it works great for archival areas:
- http://redhanded.hobix.com/inspect/duckTrustMetrics1Pagerank.html -> http://google.com/search?q=site+pagerank+utc+sites+eig
Full code
"Just run it": google_url.rb
Direct Persistent URL's - Alan (2006-06-16 (Fri) 07:38:31)
You can use the google's "I'm Feeling Lucky" button to go directly to your page.
Here is the modifed first example: http://www.google.com/search?q=ruby+rb+challenge&btnI=I%27m+Feeling+Lucky
which simply appends the following to any google query: &btnI=I%27m+Feeling+Lucky
*1 I didn't care to read http://purl.org/ 's more general one...
- 40 http://anarchaia.org
- 36 http://www.artima.com/forums/flat.jsp?forum=123&thread=154156
- 17 http://www.artima.com/buzz/community.jsp?forum=123
- 9 http://planetruby.0x42.net
- 8 http://anarchaia.org/archive/2006/03/29.html
- 8 http://anarchaia.org/archive/2006/03.html
- 6 http://del.icio.us/search/?all=soap4r
- 5 http://www.anarchaia.org
- 3 http://rubyriver.org
- 2 http://rojo.com/?feed-id=2469665
Keyword(s):[ruby] [blog] [persistent] [url] [google] [search] [snippet] [frontpage]
References:[Ruby]