Purging referrer URLs concurrently
After writing a "pooling executor" that assigns tasks to a number of handlers in parallel, filtering referrer URLs somewhat efficiently becomes easier. Checking multiple referrers concurrently helps maximize bandwidth usage, which is quite important when you have to fetch ~8000 pages. Were this done serially, the process could easily take a couple hours or more.
The script described below helped me remove nearly 95% of the referrer URLs, going from 13595 (7909 unique) to 933 (653).
Task description
My HTTP referrers (and the corresponding hits) are stored as serialized hashes in a number of files, marshalled with TMarshal, AMarshal's elder (yet simpler) sibling:
{
"http://1470.net/mm/mylist.html/272?date=2006-3-1" => 1,
"http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/181885" => 21,
"http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/181963" => 2,
"http://datavet.htu.se/kurser/KVC010/index.html" => 5,
"http://dev.rubyonrails.org/ticket/4269" => 1,
"http://fi.wikipedia.org/wiki/Uutisryhm%C3%A4" => 2,
"http://ja.reddit.com/goto?id=2i9j" => 5,
"http://ja.reddit.com/user/matz/" => 1,
"http://lists.debian.org/debian-devel/2006/03/msg00875.html" => 1,
"http://moonrock.jp:23000/articles/2006/03/03/rcov" => 1,
#....
}
I just have to iterate over all these files, verifying if those referrer locations actually exist and seem OK, and overwrite the data.
Checking a referrer
Some URLs can be discarded right away: search engines, common RSS aggregators (mostly bloglines, which I get lots of hits from), google groups... I also whitelisted a few patterns to make the script a bit faster, and cache the results to avoid downloading a page twice.
GOOD_REFERER_RE = %r{^http://(www\.)?eigenclass\.org/hiki\.rb\?[^?;]+$| http://(www\.)?anarchaia\.org}x BAD_REFERER_RE = %r{a9\.com/|google.*/search|mail\.google\.| bloglines\.com|eigenclass\.org| search\?q=cache|search\.msn\.|blogsearch\.google\.| del\.icio\.us|search\?q=| comp\.lang\.ruby }x LINK_TEXT = %r{eigenclass\.org} def check_referer(referer, count, ok_referer_hash, referer_info_cache) record_good = lambda do ok_referer_hash[referer] = count referer_info_cache[referer] = true end record_bad = lambda{ referer_info_cache[referer] = false } record_bad.call unless %r{http://}i =~ referer begin if referer_info_cache.has_key?(referer) puts "CACHE HIT: #{referer}" record_good.call if referer_info_cache[referer] elsif GOOD_REFERER_RE =~ referer && BAD_REFERER_RE !~ referer record_good.call elsif BAD_REFERER_RE =~ referer # ignore them puts "DISCARDING: #{referer}" record_bad.call else times = 0 begin Timeout.timeout(10) do open(referer) do |is| # could use a more sophisticated test (e.g. content classification) if is.read(100000)[LINK_TEXT] record_good.call else record_bad.call end end end rescue Timeout::Error times += 1 if times < 3 retry else raise end end end rescue Timeout::Error puts "GET #{referer} timed out." record_bad.call end end
I'm using Timeout and retrying up to 3 times for the lazy servers out there. The "check" isn't the epitome of sophistication, but it's very effective.
Concurrent verifiers
The key to making this run faster is checking several URLs at a time. My PoolingExecutor handles that easily, creating the threads and throttling HTTP requests for me:
require 'fileutils' require 'poolingexecutor' require 'open-uri' require 'timeout' def purge_bad_referers(filename, good_referrers = {}) referer_hash = eval(File.read(filename)) ok_referer_hash = {} executor = PoolingExecutor.new do |handlers| 15.times do # we could as well register a plain Object and use it as a "token", # with the actual code in the executor.run { ... } block handlers << lambda do |referer, count| check_referer(referer, count, ok_referer_hash, good_referrers) end end end referer_hash.each do |referer, count| executor.run do |handler| puts "Processing #{referer}" puts "Keeping #{referer}" if handler[referer, count] end end executor.wait_for_all File.open(filename, "w") do |f| f.puts "{" ok_referer_hash.keys.sort.each do |key| f.puts "#{key.inspect} => #{ok_referer_hash[key].inspect}," end f.print "}" end end
All that's left is iterating over the files I have to process:
good_referrers = {} ARGV.each do |fname| FileUtils.cp(fname, "#{fname}.bak") puts puts "=" * 80 puts fname puts "=" * 80 purge_bad_referers(fname, good_referrers) end
Keyword(s):[blog] [ruby] [concurrency] [purge] [http] [referer] [snippet] [subpar] [frontpage]
References:[Ruby]