MAIN
2008-10-27 23:20 UTC Wide Finder 2: processing 42GB of httpd logs, 300X faster than naïve Ruby.
The Wide Finder 2 benchmark measures the speed at which a program can analyze 42GB worth of webserver logs and generate basic statistics (top URLs by hits, byte count and 404 errors, top clients and referrers) on a pretty beefy machine: a 8-core Sun Fire T2000 with 32 hardware threads and 32GB of RAM.
Several people have written a multitude of implementations in various languages, including C++, OCaml, C, Java, Scala, Groovy, Fan, Python, Perl, Ruby, and combinations thereof. The results are presented on this table.
Speedup
The reference implementation was written in Ruby and took over 25 hours to process the logs. The fastest versions, coded in C++ and OCaml, can do it in around 5 minutes, nearly 300 times faster.
In the case of OCaml, this speedup can be broken down into 3 factors.
Use of a more efficient language implementation: 1 order of magnitude
This does not necessarily cause an explosion in code size, as shown by the wf2_simple.ml semi-naïve port to OCaml, which is in fact shorter than the reference Ruby implementation by line count (this is far from being the case in C++, though), but runs an order of magnitude faster.
It is interesting to analyze the bang-per-buck ratio for the different languages: while two C++ implementations ultimately yielded the highest performance (a third one failed to parallelize properly, and was over three time slower than the top C++ or OCaml versions) , they do so at the cost of an explosion in the amount of code, requiring 2 to 3 times more lines than the OCaml version whose performance was within 10% of the best one, even though the C++ entries used libraries like Boost.
Parallelism: 1 order of magnitude
The Wide Finder 2 problem is fairly easy to decompose into smaller subproblems (essentially by having several workers processing different parts of the logs). The T2K is a bit particular in that each of its 32 hardware threads (which are seen as 32 CPUs when you program) is much less powerful than the x64 cores we're using most of the time, so parallel execution is a must.
I parallelized the OCaml program trivially by using a semi-standard 15-line higher-order function which accepts a function and executes it in another process.
Small-scale optimizations that apply to both the single-threaded and the parallel cases: 2-4X
The Wide Finder 2 implies lots of IO activity, which proved to be relatively hard to optimize on the T2K, because you can easily saturate a core by doing mere IO (i.e., a single core is barely able to cope with the sustained read rate of the disk). While my first OCaml implementations used line-oriented IO, like the reference version in Ruby, the fastest one (like most other entries) uses block-oriented IO, which accounted for a large part of the linecount increase.
Conclusions
Optimization potential
Read more...
2008-09-01 10:15 UTC Elo ratings for the Benchmarks Game (aka Great Computer Language Shootout)
The geometric mean, as used by the Computer Language Benchmarks Game, gives too much importance to outliers and results in unstable rankings that do not reflect overall performance consistency well. The Elo rating system can be applied to the same results to obtain a better ranking where the influence of small speed perturbations is minimized and all-round performance is favored.
Even though the Great Computer Language Shootout is now called the Computer Language Benchmarks Game, its results are often used incorrectly to compare language implementations (and even worse, this is generalized to languages as well) performance-wise by merely comparing the scores without any further thought. Many of the problems stem from the choice of the geometric mean of the execution time compared to the fastest entry for each benchmark and implementation. More on that below.
If it's a game, why not use an actual rating system like those used in chess or go? I use the Elo rating system. A win (see below for the definition of what represents a win) is worth one point, a draw 0.5 and a loss 0, and the expected score of program A against program B is

where
represents the "strength" (current rating) of
the language implementation "X".
All the possible pairings of benchmark implementations are considered, and the ratings are updated at the end of each round (this is essential to avoid dependence on the order of the pairings) with

(K is a suitably small coefficient to avoid oscillations.)
By choosing the scoring function carefully, the resulting ratings admit an interesting interpretation and allow to draw more meaningful conclusions than the geometric mean.
Here follow the Elo ratings for the x64 Ubuntu (Intel® Q6600® quad-core) data as of 2008-09-01, where a program is considered the winner when it runs in less than half as much time as its opponent. (The programs used to generate it and the justification of this scoring function are given below.)
gpp 1405
gcc 1397
ocaml 1336
mlton 1317
java 1307
ghc 1301
fpascal 1271
cal 1271
scala 1261
gnat 1252
nice 1246
csharp 1199
hipe 952
mzscheme 928
perl 855
python 792
lua 791
pike 767
javaxint 763
yap 728
groovy 687
tcl 624
gst 607
php 532
javascript 410
This ranking includes languages omitted in the original one and the relative order differs a bit. This is because this rating reflects something different, arguably more interesting.
Source code
I use a small Ruby script (scrap.rb) to screen scrap the shootout site and dump the data in JSON format that will look like
{
"pidigits": {
"mzscheme": 28.27,
"hipe": 17.04,
...
},
"fannkuch": {
...
},
...
}
It is then processed by this OCaml program:
open Printf module H = ExtHashtbl.Hashtbl module L = ExtList.List let (|>) x f = f x let init_rating = 1000. let k = 5. let rounds = 100 let win_ratio = 1.0 /. 2.0 let actual_score ta tb = if ta <= win_ratio *. tb then 1. else if tb <= win_ratio *. ta then 0. else 0.5 let expected_score ra rb = 1. /. (1. +. 10. ** ((rb -. ra) /. 500.)) let new_ratings ra rb ta tb = let d = actual_score ta tb -. expected_score ra rb in (ra +. k *. d, rb -. k *. d) let rating_deltas ra rb ta tb = let r1, r2 = new_ratings ra rb ta tb in (r1 -. ra, r2 -. rb) type json shootout_data = (string * (string * float) assoc) assoc let get t lang = H.find_default t lang init_rating let inc t v k delta = H.replace t k (H.find_default t k v +. delta) let run data = let ratings = H.create 13 in let do_match deltas (lang1, t1) (lang2, t2) = if lang1 <> lang2 then let d1, d2 = rating_deltas (get ratings lang1) (get ratings lang2) t1 t2 in inc deltas 0. lang1 d1; inc deltas 0. lang2 d2 in let round () = let ds = H.create 13 in L.iter (fun (_, rs) -> L.iter (fun l1 -> L.iter (do_match ds l1) rs) rs) data; H.iter (inc ratings init_rating) ds in for i = 1 to rounds do round () done; H.fold (fun lang r l -> (r, lang) :: l) ratings [] |> L.sort |> L.rev |> L.iter (fun (rating, lang) -> printf "%16s %5.0f\n" lang rating) let () = match Array.length Sys.argv with 2 -> Json_io.load_json Sys.argv.(1) |> shootout_data_of_json |> run | _ -> printf "elo_rating <file>\n"; exit 0
The only remarkable thing is the use of the json-static syntax extension to load the JSON data type-safely with only one line of code which declares the type and generates the shootout_data_of_json function used to verify and load JSON data:
type json shootout_data = (string * (string * float) assoc) assoc
Deficiencies of the geometric mean and justification of the scoring function
Read more...
2008-08-27 15:29 UTC Some functional programming and OCaml koans
let rec
One day, a disciple of another sect came to Xavier Leroy and said mockingly:
"The OCaml compiler seems very limited: why do you have to indicate when a function is recursive, cannot the compiler infer it?"
Xavier paused for a second, and replied patiently with the following story:
"One day, a disciple of another sect came to Xavier Leroy and said mockingly..."
Polymorphic answers
A novice came to Jacques Garrigue and spoke nervously: "I don't get rank-2 polymorphism. What is it good for? When to use it? How can I understand it?". Jacques asked: "Do you want an answer to each question, or the answer to all your questions?." The novice was enlightened.
Leroy instructs a student
The venerable master Leroy was walking with his student. Wishing to start a discussion with his master, the apprentice said "Master, I've heard that all loops must be replaced with tail-recursive functions. Is that true?" Leroy looked commiseratively at his student and replied "Foolish pupil, many tail-recursive functions are merely inefficient loops."
The student spent the next few weeks replacing tail-recursive functions with explicit loops. He finally showed his code to master Leroy, seeking his approval. Leroy hit him with a stick. "When will you learn? Explicit loops are a poor man's tail-recursive functions." At that moment, the student became enlightened.
Complete solutions
Anton had been studying the dining philosophers problem and was ecstatic when he presented a novel solution to Damien Doligez. The great master did not show the same enthusiasm, though: all he said was "the solution is not complete". Anton knew that Doligez had written a machine-checked proof of correctness for the concurrent GC he developed in the early 90s for Caml Light, and began to work. After struggling with Coq, he came back to Doligez.
"I have proved that my solution prevents starvation and livelock."
Again, Doligez replied: "the solution is not complete".
Anton retorted: "My correctness proof has been verified mechanically. Moreover, my simulations show that my method allows for a large degree of concurrency, and it's easy to implement efficiently. Surely the solution is complete now?"
The Zen master then explained: "You claim to have solved the dining philosophers problem, yet you haven't expounded where the spaghetti come from."
Pierre Weis and the variance annotations
Read more...
2008-07-17 11:38 UTC Reexamining qsort, eager vs. lazy algorithm analysis and Ruby's (and other's) GC
Yesterday's post on The Comonad.Reader referred to my analysis of the list-based "quicksort" and described a faster function based on difference lists. It gave an example that troubled me for a while, and added that "the author [referring to me] was counting conses, unfortunately that isn't a valid metric":
reverse (a:as) = reverse as ++ [a] reverse [] = []
This function exhibits
performance, even though
there seems to be one cons per level; this would show that just counting
conses isn't enough, and invalidate my analysis. The catch is that
there is not only one cons per level, and one must be careful and not
forget the conses implied by the list concatenation operation. In my analysis
of qsort, the number of conses verified

with
, where the final
term represents the conses needed to concatenate the sorted sublists.
If the above function is analyzed the same way I did with qsort, the number of conses satisfies

which clearly implies
.
The key term is
, which represents the conses
associated to the concatenation, the same way
did in
qsort's expression; had it not been there, the function would be
. So, while this example doesn't invalidate my proof,
it serves as a good reminder of the cost of list concatenation and how easily
it turns linear-time algorithms into
.
(Incidentally, it turns out that qsort would be
even without the extra term caused by list concatenation, so the proof, albeit
incorrect, would have reached the correct conclusion.)
Strict vs. non-strict, eager vs. lazy analysis
While it seems the original analysis was sound, a new question was raised indirectly when Edward Kmett wrote that
it's not the number of times you cons that matters, but the number of thunks that have to be forced to get to the answer
I've shown above that this is only true if you aren't careful when counting conses, but the problem it suggests deserves to be considered by itself: can the analysis under a strict semantics be taken as an upper bound of the complexity in the non-strict or lazy cases?
I always assumed that the answer was a straight "yes". Intuitively, under lazy evaluation you have thunks with the data necessary to evaluate their associated expressions only when needed, so surely at most as much work as if they were always evaluated is performed, right?
After pondering about this longly, I've found a case where this wouldn't be true, suggested directly by an alternative definition of non-strict semantics:
Non-strict semantics means that a function can have a definite value although its argument is undefined.
What if an algorithm depended on an undefined value for termination? In practice, this would mean that it relies on an expression raising an exception to stop the computation and return some result after recovery. Whereas in the strict case the expression would be evaluated unconditionally and the computation would terminate, with a non-strict semantics it would be left unevaluated if not needed, thus failing to raise the exception and finish in a timely manner.
Edit (2008-07-18): There's an example below for an impure language. Anyway impurity + laziness is a nonsensical combination and the example represents 2 different programs in one that happen to give the same answer, so I'll be happy to analyze pure programs and assume that "lazy evaluation always needs less reduction steps than eager evaluation" to obtain upper bounds (thanks apfelmus).
The above case is contrived, and such a program would be needlessly complex. In fact, in OCaml it would be expressed awkwardly with a ref, and in Haskell pattern matching over an Either value (implicit in bind) would normally force evaluation, thus behaving as the strict counterpart.
Barring this exception, it seems to me that analyzing under a strict semantics remains a good way to get an upper bound for the non-strict case, but lacking a formal proof (and even with one) you can never be too sure. Any pointers?
Something relevant might be found by following the trail given by this paper by P. Wadler, whose abstract states the following:
In a strict functional language, analysis of time complexity is straightforward, because of the following compositional rule: The time to evaluate (ƒ(g x)) equals the time to evaluate (g x) plus the time to evaluate (ƒ y), where y is the value of (g x). However, in a lazy language, this rule only gives an upper bound, possibly a crude one.
I haven't read the paper yet nor followed its references/citations to see if there's a proof for that. The problem I can foresee is that the counterexample I came up with might not happen in the simplified languages/computational models used in most papers.
At any rate, it seems that, for all but a few contrived programs (relying on "useless" expressions being evaluated to raise an exception and terminate), simple "eager" analysis remains a good way to obtain an upper bound for the non-strict case. If it is low enough, we can stop there and enjoy.
Superlinear GC work: why Ruby scales badly, and a possible explanation of qsort's behavior?
Read more...
2008-07-15 10:52 UTC Quicksort erratum
A few days ago, I called the following Haskell function, reminiscent of
Quicksort and considered the epitome of beautiful code by many, unusable for
taking "always
space and time":
qsort [] = [] qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++ qsort (filter (>= x) xs)
Moreover, I referred to this version (using a C-like DSEL in Haskell)
due to Lennart Augustsson
as a usable one (as opposed to a
algorithm):
qsortM :: forall i a m r arr .
(MonadRef m r, Num i, Ix i, MArray arr a m, Ord a Bool) =>
arr i a -> m (arr i a)
qsortM ma = runE $ do
(lb, ub) <- embed $ getBounds ma
let mlb = pure0 lb
mub = pure0 ub
a <- liftArray ma
let qsort' l r =
if1 (r > l) $ do
i <- auto l
j <- auto (r+1)
let v = a[l] :: E m a
iLTj = i < (j :: E m i)
while iLTj $ do
while ((i += 1) < mub && a[i] < v)
skip
while (a[j -= 1] > v)
skip
if1 iLTj $ do
a[i] =:= a[j]
a[l] =:= a[j]
qsort' l (j-1)
qsort' i r
qsort' mlb mub
return ma
I was wrong. Instead of analyzing it with some care, I believed the
claims I read long ago in some now forgotten blog. I just saw the list
concatenations and reacted without thinking.
Finding an upper bound for the number of comparisons in the best case is easy. In fact, while I'am at it, I'll find a bound for the number of cons operations, which is clearly greater than that of comparisons.
Noting the number of conses C(n) for a list of size n, for n > 3 (there are a
few irregularities for n = 1 to 3 because no cons is needed when appending or
prepending [] to another list),
, where the problem for a list
of length n has been divided into two halves (lists of length p and q) plus a
lone pivot element. p (resp. q) conses are needed to build the list passed to the
recursive call to qsort, C(p) (C(q)) conses are needed for the subproblem, and
finally p + 1 conses are needed to concatenate the resulting sorted sublists.
The problem is how to find p and q. For the best case analysis,
. The n list can be divided into
the desired three parts as follows:


Therefore,

(There are two choices for the number of conses in the final list concatenation, differing by 1. The above is a conservative one, which will give an upper bound.)
Even after referring to Graham & Knuth & Patashnik, I saw no immediate way to solve this recurrence (please drop a comment if you have a closed-form solution). Fortunately, I just need an upper bound so I can use

which is easily proved by induction, since all I've done is to remove the floor operations.
Then, using the master theorem


(Note the use of
for C(n) and
for
D(n).)
The above program thus uses at most
space and
time. The upper bound is satisfactory enough; in fact, it's the same as for
the real quicksort, and, furthermore, the lowest achievable bound for a
comparison sort.
So far, I've assumed that the program evaluates everything eagerly, which might not be the case in Haskell, as it has a non-strict semantics (which is not exactly the same as being lazy). Let's see if GHC is really smart and can recognize that qsort(l) is just a permutation of l to skip the sort altogether when some operation that doesn't depend on the order is performed (this is quite close to proving that the qsort function actually sorts the list, and would be an amazing exploit indeed).
To that end, I first rewrite the function in OCaml; if GHC is able to do less work thanks to its non-strict semantics, the solution will scale better than with OCaml:
let rec qsort = function [] -> [] | x::xs -> qsort (List.filter ((>) x) xs) @ [x] @ qsort (List.filter ((<=) x) xs)
It's very similar to the Haskell one; the main difference is in the signs used with List.filter: ((>) x) means "x is greater than _", as opposed to "_ is greater than x" in Haskell.
The main function just creates a list of random integers, sums them up, and finally adds the numbers from the sorted list --- obtaining the same number, I hope (the sums are printed to stdout to make sure that computation is not omitted).
Here are the times I get:
The GHC binary takes nearly 800MB to sort a 3 million list, which explains why it suffers as n increases.
Well, this isn't... what was it, GHC 11? Being lazy does pay off if you try to get the nth (and in particular the first) element, though; using
Read more...
- 1740 http://redhanded.hobix.com
- 1321 http://www.rubyinside.com
- 694 http://planetruby.0x42.net
- 421 http://www.rubyweeklynews.org
- 365 http://jp.rubyist.net/magazine/?RubyKaigi2006-0610-2
- 319 http://redhanded.hobix.com/cult/variousPresents.html
- 287 http://programming.reddit.com/info/5znh8/comments
- 275 http://en.wikipedia.org/wiki/Metaprogramming
- 268 http://www.onlinebizbuzz.com
- 260 http://rubyinside.com
Keyword(s):
References:[Opening up my hiki wiki: bliki.rb plugin]