eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

MAIN

eigenclass is migrating to a new engine, find the latest contents here

What's eigenclass.org?


2008-10-27 23:20 UTC Wide Finder 2: processing 42GB of httpd logs, 300X faster than naïve Ruby.

The Wide Finder 2 benchmark measures the speed at which a program can analyze 42GB worth of webserver logs and generate basic statistics (top URLs by hits, byte count and 404 errors, top clients and referrers) on a pretty beefy machine: a 8-core Sun Fire T2000 with 32 hardware threads and 32GB of RAM.

Several people have written a multitude of implementations in various languages, including C++, OCaml, C, Java, Scala, Groovy, Fan, Python, Perl, Ruby, and combinations thereof. The results are presented on this table.

Speedup

The reference implementation was written in Ruby and took over 25 hours to process the logs. The fastest versions, coded in C++ and OCaml, can do it in around 5 minutes, nearly 300 times faster.

In the case of OCaml, this speedup can be broken down into 3 factors.

Use of a more efficient language implementation: 1 order of magnitude

This does not necessarily cause an explosion in code size, as shown by the wf2_simple.ml semi-naïve port to OCaml, which is in fact shorter than the reference Ruby implementation by line count (this is far from being the case in C++, though), but runs an order of magnitude faster.

It is interesting to analyze the bang-per-buck ratio for the different languages: while two C++ implementations ultimately yielded the highest performance (a third one failed to parallelize properly, and was over three time slower than the top C++ or OCaml versions) , they do so at the cost of an explosion in the amount of code, requiring 2 to 3 times more lines than the OCaml version whose performance was within 10% of the best one, even though the C++ entries used libraries like Boost.

Parallelism: 1 order of magnitude

The Wide Finder 2 problem is fairly easy to decompose into smaller subproblems (essentially by having several workers processing different parts of the logs). The T2K is a bit particular in that each of its 32 hardware threads (which are seen as 32 CPUs when you program) is much less powerful than the x64 cores we're using most of the time, so parallel execution is a must.

I parallelized the OCaml program trivially by using a semi-standard 15-line higher-order function which accepts a function and executes it in another process.

Small-scale optimizations that apply to both the single-threaded and the parallel cases: 2-4X

The Wide Finder 2 implies lots of IO activity, which proved to be relatively hard to optimize on the T2K, because you can easily saturate a core by doing mere IO (i.e., a single core is barely able to cope with the sustained read rate of the disk). While my first OCaml implementations used line-oriented IO, like the reference version in Ruby, the fastest one (like most other entries) uses block-oriented IO, which accounted for a large part of the linecount increase.

Conclusions

Optimization potential


Read more...
Last modified:2005/11/08 03:28:40
Keyword(s):
References:[Opening up my hiki wiki: bliki.rb plugin]