eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Rethinking the Benchmark module

There's more to benchmarking than slapping your code inside

Benchmark.bm{|bm| bm.report("foo"){ ... } }

The first thing we instinctively do is add TIMES.times with a large TIMES constant. Surely "errors" cancel each other out for large enough values of TIMES, right? But, how large is large enough?

I wrote a simple AdaptativeBenchmark which works more or less like the Benchmark in stdlib, but also decides how many times to repeat execution in order to approach the average time with the desired precision for a given confidence level:

AdaptativeBenchmark.bm do |bm|
  # by default, approach the average with a 10% confidence interval for a 95%
  # confidence level
  bm.report("bar") { bar }
  bm.report("foo", :precision => 0.05, :confidence => 0.9) { foo }
end

The AdaptativeBenchmark first estimates the sample variance, population variance and population average by running the given block min_runs times (10 by default), and then uses those initial estimates to compute how many iterations are needed for the desired confidence interval and level. The initial number of runs should probably be adjusted the same way the extra ones are, but I'm not sure it's worth the effort.

class AdaptativeBenchmark
  def initialize(len, runs)
    @field_len = len
    @min_runs = runs
  end
  def self.bm(len = 10, min_runs = 10) 
    puts "#{" " * len}        \tstddev  \truns\tinterval\tconfidence"
    yield new(len, min_runs)
  end

  REPORT_OPT = {:precision => 0.1, :confidence => 0.95}
  def report(name, options = {})
    old_sync, $stdout.sync = $stdout.sync, true
    opt = REPORT_OPT.clone.update(options)
    sample_avg = sample_variance = 0

    tms_to_total = lambda{|tms| tms.utime + tms.stime }
    
    take_sample = lambda do |i|
      GC.start
      t0 = tms_to_total[Process.times]
      yield
      exec_time = tms_to_total[Process.times] - t0
      new_sample_avg = sample_avg + (exec_time - sample_avg) / i
      if i == 1
        sample_avg = new_sample_avg
        next
      end
      sample_variance = (1 - 1.0/i) * sample_variance + 
                        (i+1) * (new_sample_avg - sample_avg) ** 2
      sample_avg = new_sample_avg
    end
    
    (1..@min_runs).each{|i| take_sample[i] }

    population_variance = 1.0 * @min_runs / (@min_runs - 1) * sample_variance
    a = opt[:precision] * sample_avg
    num_runs = (sample_variance / (a ** 2 * (1 - opt[:confidence]))).ceil
    total_runs = @min_runs+num_runs
    (@min_runs+1..total_runs+1).each{|i| take_sample[i]}

    population_variance = 1.0 * total_runs / (total_runs - 1) * sample_variance
    puts "%-#{@field_len}s %8.6f\t%8.6f\t%-6d\t%8.6f\t%3d%%" % 
         [name, sample_avg, Math.sqrt(population_variance), total_runs,
          a, opt[:confidence] * 100]
  ensure
    $stdout.sync = old_sync
  end
end

The maths

In order to avoid storing all the sample values, the sample average is computed with the recursive relation

mu sub {i + 1} = mu sub i + { x sub {i + 1 } - mu sub i} over { i + 1}

and the sample variance with

s sub {i+1} sup 2 = left ( 1 - 1 over i right ) s sup 2 + (i + 1) left ( mu sub {i + 1} - mu sub i right ) sup 2

the population variance being estimated as sigma sup 2 = N over {N-1} s sub N sup 2. The number of iterations is taken as

n = left ceiling { sigma sup 2 } over { ( alpha mu ) sup 2 (1-P)} right ceiling

where alpha is the desired confidence interval (as a fraction of the population average) and P the confidence level, given as the :precision and :confidence parameters.

Limitations

As said above, the initial estimates determine how many runs are needed, so at times way too many will be deemed necessary if the variance was high for some reason or the desired confidence interval (level) is too narrow (high).

Computing the maximum likelihood estimates with the above relations makes sense if there are lots of samples, but the basic definitions are probably more practical.

Example

TIMES = 100_000
AdaptativeBenchmark.bm(10) do |bm|
  a = 0
  $a = 0
  @a = 0
  bm.report("lvar"){ TIMES.times {|a| a; a; a; a; a} }
  bm.report("dvar_curr"){ TIMES.times{|b| b; b; b; b; b} }
  bm.report("dvar"){ 1.times{|b| 1.times{|c| 1.times{|d| TIMES.times{|b| b; b; b; b; b} } } } }
  bm.report("gvar"){ TIMES.times{|$a| $a; $a; $a; $a; $a} } 
  bm.report("ivar"){ TIMES.times{|@a| @a; @a; @a; @a; @a} }
  # don't really care about cvar, constant
end

                        stddev          runs    interval        confidence
lvar       0.070500     0.003436        19      0.007100         95%
dvar_curr  0.069048     0.003481        20      0.006800         95%
dvar       0.076000     0.005712        24      0.007600         95%
gvar       0.070000     0.000000        11      0.007000         95%
ivar       0.100714     0.003397        13      0.010100         95%

BenchUnit - Daniel Berger (2006-05-10 (Wed) 09:03:16)

I blogged about the idea of a BenchUnit way back at http://djberg96.livejournal.com/52734.html

Perhaps we can join these two ideas....

mfp 2006-05-10 (Wed) 09:52:57

Hmm that gives me another idea: testing non-functional requirements the same way we unit test. Imagine

 assert_complexity(:linear) do |n,runner|
   struct = FunkyStruct.new
   # ...
   runner.run { n.times{|i| struct.insert(foo(i)) } }
 end

Daniel Berger 2006-05-11 (Thr) 08:31:54

I feel "ZenBench" coming on...


Last modified:2006/05/10 06:05:14
Keyword(s):[blog] [ruby] [frontpage] [benchmark] [snippet] [probability] [estimate]
References:[Ruby]