Wide Finder 2: processing 42GB of httpd logs, 300X faster than naïve Ruby.
The Wide Finder 2 benchmark measures the speed at which a program can analyze 42GB worth of webserver logs and generate basic statistics (top URLs by hits, byte count and 404 errors, top clients and referrers) on a pretty beefy machine: a 8-core Sun Fire T2000 with 32 hardware threads and 32GB of RAM.
Several people have written a multitude of implementations in various languages, including C++, OCaml, C, Java, Scala, Groovy, Fan, Python, Perl, Ruby, and combinations thereof. The results are presented on this table.
Speedup
The reference implementation was written in Ruby and took over 25 hours to process the logs. The fastest versions, coded in C++ and OCaml, can do it in around 5 minutes, nearly 300 times faster.
In the case of OCaml, this speedup can be broken down into 3 factors.
Use of a more efficient language implementation: 1 order of magnitude
This does not necessarily cause an explosion in code size, as shown by the wf2_simple.ml semi-naïve port to OCaml, which is in fact shorter than the reference Ruby implementation by line count (this is far from being the case in C++, though), but runs an order of magnitude faster.
It is interesting to analyze the bang-per-buck ratio for the different languages: while two C++ implementations ultimately yielded the highest performance (a third one failed to parallelize properly, and was over three time slower than the top C++ or OCaml versions) , they do so at the cost of an explosion in the amount of code, requiring 2 to 3 times more lines than the OCaml version whose performance was within 10% of the best one, even though the C++ entries used libraries like Boost.
Parallelism: 1 order of magnitude
The Wide Finder 2 problem is fairly easy to decompose into smaller subproblems (essentially by having several workers processing different parts of the logs). The T2K is a bit particular in that each of its 32 hardware threads (which are seen as 32 CPUs when you program) is much less powerful than the x64 cores we're using most of the time, so parallel execution is a must.
I parallelized the OCaml program trivially by using a semi-standard 15-line higher-order function which accepts a function and executes it in another process.
Small-scale optimizations that apply to both the single-threaded and the parallel cases: 2-4X
The Wide Finder 2 implies lots of IO activity, which proved to be relatively hard to optimize on the T2K, because you can easily saturate a core by doing mere IO (i.e., a single core is barely able to cope with the sustained read rate of the disk). While my first OCaml implementations used line-oriented IO, like the reference version in Ruby, the fastest one (like most other entries) uses block-oriented IO, which accounted for a large part of the linecount increase.
Conclusions
Optimization potential
There's as much to gain from parallelism as from more efficient single-core execution at this stage. The fastest programs are roughly 300X faster than the reference implementation in Ruby, and need 15 to 30 times less CPU time, with a gain attributable to parallelism ranging from 8X to 17X --- less than the speedup in serial execution. This is remarkable because the Wide Finder 2 benchmark involves relatively little processing, and the most expensive operations in the Ruby implementation use core functionality written in C. Ruby is so slow at the remaining 10-30% (by CPU time) of the work (which OTOH represents most of the program) that the end result is still an order of magnitude slower even though much time is spent in relatively optimized C code.
In fact, at least in the OCaml case, most of the engineering effort goes into optimizations that apply to both the single- and the multi-core cases. In the OCaml programs, parallelism imposes a minimal overhead; this is shown clearly in wf2_multicore2_block_small.ml, where the "parallel tax" is around 5 lines of code. There is no shared data, no locks, and none of the complexity associated to threads.
Nature of the Wide Finder 2 benchmark
The processing speed is ultimately determined by the IO subsystem. Only C++ and OCaml were able to reach that bottleneck. Note the relatively large difference in user CPU time between the top three entries, which all run in ~5 minutes clock time nevertheless. The entry taking the least CPU time (15% less than the closest one) is not the fastest (it's 7% slower than said entry). The OCaml entry is only 10% slower, even though it takes over twice as much CPU time! When I modified wf2_multicore2_block.ml to optimize a single function representing around half of the CPU time (the one that splits a line), thus putting the CPU time below 1 hour (i.e., comparable to the fastest C++ implementations), I observed no noticeable difference in the final execution speed (I didn't add these results to the table for that reason). This shows that the bottleneck is indeed IO.
a bit misleading - Lionel Barret de Nazaris (2008-10-28 (Mardi) 00:54:49)
in the table, the second best (in python) in 100x faster than ruby. the ruby data point is very extreme and ,from a data analysis pov, quite irrelevant. it is a kind of anomaly (sp?). (Note : I have nothing against ruby, I am talking from a data analysis pov.)
it only prooves that the ruby interpreter is very slow and should not be in the same category/chart.
Remove ruby (with the information that it was removed) and you have a much better distribution you can work on.
mfp 2008-10-29 (Wed) 13:31:39
I think the reference implementation in Ruby is an important data point: it shows the performance you can expect from a straightforward, non-parallel implementation in an interpreted language for a task it should be quite good at: string processing & simple analysis.
oi - Heh (2008-10-28 (Tue) 01:04:50)
What about disk write speed? 42GB / 5 minutes = 143.36 MBps. With a faster hard drive it won't be a bottleneck.
mfp 2008-10-29 (Wed) 13:39:25
The goal of the Wide Finder 2 benchmark is to see how fast the task can be solved for the given hardware without excessive complexity, not to change the hardware ;)
Here's some speculation: if IO were faster, the gap between C++ / OCaml and the other languages would increase substantially. See for instance how the top Scala version takes over 7 times as much CPU time as the top entries, but is only 3x slower. If IO were not the bottleneck, we'd probably see C++ and OCaml processing the 42GB in less than 3 minutes.
optimised Ruby? - Shot (Piotr Szotkowski) (2008-10-28 (Tue) 06:36:03)
It would be most interesting to see how much can a Ruby solution gain by using (a) naïve parallelism with the Forkoff gem and (b) some RubyInline love.
mfp 2008-10-29 (Wed) 13:48:02
I'd expect the same gains thanks to parallelism as in the case of the Python implementations (a factor of 20x to 25x).
I'm not sure RubyInline (or C extensions written by hand) would result in that large a gain, however. Method dispatching is exceedingly slow in Ruby (even in the case of a method cache hit), so calling a method implemented in C for each line could be quite expensive. That is, the overhead introduced by method dispatching in Ruby is significant. We'd thus be tempted to process whole blocks in a single method call (instead of a single line), but at that point the program is essentially written in C, with just a little Ruby around. The end result would be as complex as the C++ implementations, but slower...
Block I/O - Fred (2008-10-28 (Tue) 12:37:08)
Ruby can be made to read blocks instead of lines. Most line reading implementation copy over the line into a string; This should be avoided for big jobs.
What version numbers for each language processor?
mfp 2008-10-29 (Wed) 14:16:50
It should indeed be possible to make the Ruby version maybe 2-3 times faster by optimizing with some care. If you keep everything in Ruby, block-based IO would involve less copying and decrease greatly the amount of GC work. However, extra code (in Ruby) would be needed. This matters a great deal because (as it often happens in Ruby), you have to pick between a suboptimal choice implemented in C (core class) and a better solution in Ruby. Since any amount of pure Ruby code often carries a 100x slowdown (interpretative overhead, slow dispatching, etc.), it is hard to make up for the speed loss when compared to the core classes written in C, even when you have algorithmic (complexity) improvements. This is not the case in e.g. OCaml, where the libraries are also written in OCaml and everything happens at C-like speed anyway: you can easily craft code that outperforms the standard/core libraries by suiting it to the problem at hand.
Looking around in the T2K machine that ran everything, it seems to be using Python 2.5.2, GCC 3.4.3, OCaml 3.10.2, Ruby 1.8.6, Java 1.6.0_06-b02, perl 5.8.8. I don't know about the other languages (I suppose the people who used Scala, Fan or Groovy picked the latest versions because they had to install them themselvers).
Keyword(s):[blog] [ocaml] [frontpage] [widefinder] [ruby] [parallel] [optimization]
References: