eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Hashes with default procs as memoized functions

Thinking of optimizing the small script I wrote long ago to analyze the geographical distribution of eigenclass' readership, I bumped into yet another instance of the "hash with default block as memoized function" idiom. I've been finding that quite often when reviewing older code of mine (or at least feel that way).

That script uses GeoIP's command-line utility to determine which country an IP address is associated to. Calling an external program is really slow so I absolutely want to cache those results:

iptocountry = Hash.new do |h,ip|
  h[ip] = `geoiplookup #{ip}`.chomp.gsub(/^GeoIP Country Edition: /,"")
end
iptocountry.update Marshal.load(File.read("geo.cache")) rescue nil

One minor inconvenience about hashes with default procs is that they cannot be serialized, so it would seem one needs something like

File.open("geo.cache", "w"){|f| Marshal.dump(Hash[*iptocountry.to_a.flatten], f)}

We can do much better though:

iptocountry.default = nil
File.open("geo.cache", "w"){|f| Marshal.dump(iptocountry, f)}

Hash#default= will override the default specified in the Hash.new call.


Some code

Here's a censored version of my script*1. It processes a 70MB access.log in 4 seconds when the results are cached.

loglines = ARGF.readlines
results = Hash.new{|h,k| h[k] = 0}
iptocountry = Hash.new do |h,ip|
  h[ip] = `geoiplookup #{ip}`.chomp.gsub(/^GeoIP Country Edition: /,"")
end
iptocountry.update Marshal.load(File.read("geo.cache")) rescue nil
loglines.map do |line|
  ip = line[/\S+(?=\s)/]
end.uniq.each do |ip|
  results[iptocountry[ip]] += 1
end
total = iptocountry.size
results.sort_by{|k,v| v}.reverse_each{|k,v| puts "%25s %5d %5.2f%%" % [k,v,100.0*v/total]}
iptocountry.default = nil
File.open("geo.cache", "w"){|f| Marshal.dump(iptocountry, f)}

puts "=" * 80
puts "Total unique IPs: #{total}"


I find myself doing this a lot too.. - Maurice (2005-12-28 (Wed) 06:58:56)

So I made a simple wrapper class that uses method_missing to do something like this for arbitrary objects. It lets you do simple time-based caching, to clear the cache, and to define which methods are cachable.

I like the use of the hash default block to do this.. I may update my class to do that as well.

I have it up on my projects page:

http://mauricecodik.com/projects/


mfp 2005-12-30 (Fri) 04:44:57

I've read your caching.rb and have a couple questions. I'm not sure I see the point of having a timeout for cached values: you're not using it to remove old entries from the cache table (so it will keep growing monotonically), and my gut feeling is that time is too weak a guarantee for the validity of the computation. Which sort of computations benefit from (or rather require, so that the results are correct) such a time-based caching approach? On a lower level... are you sure @cache[m][args.inspect] is preferable to @cache[m][args]? The latter relies on #hash and #eql? for each parameter but the former feels fragile: some objects don't have a meaningful string representation (and you effectively end up doing something equivalent to @cache[m][args.map{|x| x.object_id}] ).


Daniel Berger 2005-12-31 (Sat) 14:56:19

Maybe I should add a cache to file option for the memoize package. BTW, the comment section on your entries could use some work.


mfp 2006-01-04 (Wed) 07:20:23

Working on it (have to take some protective measures against spam...)

Last modified:2005/12/28 18:19:54
Keyword(s):[blog] [ruby] [hash] [default] [block] [memoize]
References:[Ruby] [Bounded space memoization, another new twist]

*1 The original is named anal.rb and uses an anal.cache file. I'll probably rename the latter for the sake of clarity.