eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

MAIN

What's eigenclass.org?


2008-10-27 23:20 UTC Wide Finder 2: processing 42GB of httpd logs, 300X faster than naïve Ruby.

The Wide Finder 2 benchmark measures the speed at which a program can analyze 42GB worth of webserver logs and generate basic statistics (top URLs by hits, byte count and 404 errors, top clients and referrers) on a pretty beefy machine: a 8-core Sun Fire T2000 with 32 hardware threads and 32GB of RAM.

Several people have written a multitude of implementations in various languages, including C++, OCaml, C, Java, Scala, Groovy, Fan, Python, Perl, Ruby, and combinations thereof. The results are presented on this table.

Speedup

The reference implementation was written in Ruby and took over 25 hours to process the logs. The fastest versions, coded in C++ and OCaml, can do it in around 5 minutes, nearly 300 times faster.

In the case of OCaml, this speedup can be broken down into 3 factors.

Use of a more efficient language implementation: 1 order of magnitude

This does not necessarily cause an explosion in code size, as shown by the wf2_simple.ml semi-naïve port to OCaml, which is in fact shorter than the reference Ruby implementation by line count (this is far from being the case in C++, though), but runs an order of magnitude faster.

It is interesting to analyze the bang-per-buck ratio for the different languages: while two C++ implementations ultimately yielded the highest performance (a third one failed to parallelize properly, and was over three time slower than the top C++ or OCaml versions) , they do so at the cost of an explosion in the amount of code, requiring 2 to 3 times more lines than the OCaml version whose performance was within 10% of the best one, even though the C++ entries used libraries like Boost.

Parallelism: 1 order of magnitude

The Wide Finder 2 problem is fairly easy to decompose into smaller subproblems (essentially by having several workers processing different parts of the logs). The T2K is a bit particular in that each of its 32 hardware threads (which are seen as 32 CPUs when you program) is much less powerful than the x64 cores we're using most of the time, so parallel execution is a must.

I parallelized the OCaml program trivially by using a semi-standard 15-line higher-order function which accepts a function and executes it in another process.

Small-scale optimizations that apply to both the single-threaded and the parallel cases: 2-4X

The Wide Finder 2 implies lots of IO activity, which proved to be relatively hard to optimize on the T2K, because you can easily saturate a core by doing mere IO (i.e., a single core is barely able to cope with the sustained read rate of the disk). While my first OCaml implementations used line-oriented IO, like the reference version in Ruby, the fastest one (like most other entries) uses block-oriented IO, which accounted for a large part of the linecount increase.

Conclusions

Optimization potential


Read more...

2008-09-01 10:15 UTC Elo ratings for the Benchmarks Game (aka Great Computer Language Shootout)

The geometric mean, as used by the Computer Language Benchmarks Game, gives too much importance to outliers and results in unstable rankings that do not reflect overall performance consistency well. The Elo rating system can be applied to the same results to obtain a better ranking where the influence of small speed perturbations is minimized and all-round performance is favored.

Even though the Great Computer Language Shootout is now called the Computer Language Benchmarks Game, its results are often used incorrectly to compare language implementations (and even worse, this is generalized to languages as well) performance-wise by merely comparing the scores without any further thought. Many of the problems stem from the choice of the geometric mean of the execution time compared to the fastest entry for each benchmark and implementation. More on that below.

If it's a game, why not use an actual rating system like those used in chess or go? I use the Elo rating system. A win (see below for the definition of what represents a win) is worth one point, a draw 0.5 and a loss 0, and the expected score of program A against program B is

1 over { 1 + 10 sup { (R sub B - R sub A ) / 500 } }

where R sub X represents the "strength" (current rating) of the language implementation "X".

All the possible pairings of benchmark implementations are considered, and the ratings are updated at the end of each round (this is essential to avoid dependence on the order of the pairings) with

R' sub A  = R sub A + sum from benchmarks { K ( ActualScore sub A - ExpectedScore sub A ) }

(K is a suitably small coefficient to avoid oscillations.)

By choosing the scoring function carefully, the resulting ratings admit an interesting interpretation and allow to draw more meaningful conclusions than the geometric mean.

Here follow the Elo ratings for the x64 Ubuntu (Intel® Q6600® quad-core) data as of 2008-09-01, where a program is considered the winner when it runs in less than half as much time as its opponent. (The programs used to generate it and the justification of this scoring function are given below.)

            gpp   1405
            gcc   1397
          ocaml   1336
          mlton   1317
           java   1307
            ghc   1301
        fpascal   1271
            cal   1271
          scala   1261
           gnat   1252
           nice   1246
         csharp   1199
           hipe    952
       mzscheme    928
           perl    855
         python    792
            lua    791
           pike    767
       javaxint    763
            yap    728
         groovy    687
            tcl    624
            gst    607
            php    532
     javascript    410

This ranking includes languages omitted in the original one and the relative order differs a bit. This is because this rating reflects something different, arguably more interesting.

Source code

I use a small Ruby script (scrap.rb) to screen scrap the shootout site and dump the data in JSON format that will look like

 {
   "pidigits": {
     "mzscheme": 28.27,
     "hipe": 17.04,
     ...
   },
   "fannkuch": {
     ...
   },
   ...
 }

It is then processed by this OCaml program:

open Printf
module H = ExtHashtbl.Hashtbl
module L = ExtList.List
let (|>) x f = f x

let init_rating = 1000.
let k = 5.
let rounds = 100
let win_ratio = 1.0 /. 2.0

let actual_score ta tb =
  if ta <= win_ratio *. tb then 1.
  else if tb <= win_ratio *. ta then 0.
  else 0.5

let expected_score ra rb = 1. /. (1. +. 10. ** ((rb -. ra) /. 500.))

let new_ratings ra rb ta tb =
  let d = actual_score ta tb -. expected_score ra rb in
    (ra +. k *. d, rb -. k *. d)

let rating_deltas ra rb ta tb =
  let r1, r2 = new_ratings ra rb ta tb in
    (r1 -. ra, r2 -. rb)

type json shootout_data = (string * (string * float) assoc) assoc

let get t lang = H.find_default t lang init_rating
let inc t v k delta = H.replace t k (H.find_default t k v +. delta)

let run data =
  let ratings = H.create 13 in
  let do_match deltas (lang1, t1) (lang2, t2) =
    if lang1 <> lang2 then
      let d1, d2 = rating_deltas (get ratings lang1) (get ratings lang2) t1 t2 in
        inc deltas 0. lang1 d1;
        inc deltas 0. lang2 d2 in
  let round () =
    let ds = H.create 13 in
      L.iter (fun (_, rs) -> L.iter (fun l1 -> L.iter (do_match ds l1) rs) rs) data;
      H.iter (inc ratings init_rating) ds
  in
    for i = 1 to rounds do round () done;
    H.fold (fun lang r l -> (r, lang) :: l) ratings [] |> L.sort |> L.rev |>
    L.iter (fun (rating, lang) -> printf "%16s  %5.0f\n" lang rating)

let () = match Array.length Sys.argv with
    2 -> Json_io.load_json Sys.argv.(1) |> shootout_data_of_json |> run
  | _ -> printf "elo_rating <file>\n";
         exit 0

The only remarkable thing is the use of the json-static syntax extension to load the JSON data type-safely with only one line of code which declares the type and generates the shootout_data_of_json function used to verify and load JSON data:

 type json shootout_data = (string * (string * float) assoc) assoc

Deficiencies of the geometric mean and justification of the scoring function


Read more...

2008-08-27 15:29 UTC Some functional programming and OCaml koans

let rec

One day, a disciple of another sect came to Xavier Leroy and said mockingly:

"The OCaml compiler seems very limited: why do you have to indicate when a function is recursive, cannot the compiler infer it?"

Xavier paused for a second, and replied patiently with the following story:

"One day, a disciple of another sect came to Xavier Leroy and said mockingly..."

Polymorphic answers

A novice came to Jacques Garrigue and spoke nervously: "I don't get rank-2 polymorphism. What is it good for? When to use it? How can I understand it?". Jacques asked: "Do you want an answer to each question, or the answer to all your questions?." The novice was enlightened.

Leroy instructs a student

The venerable master Leroy was walking with his student. Wishing to start a discussion with his master, the apprentice said "Master, I've heard that all loops must be replaced with tail-recursive functions. Is that true?" Leroy looked commiseratively at his student and replied "Foolish pupil, many tail-recursive functions are merely inefficient loops."

The student spent the next few weeks replacing tail-recursive functions with explicit loops. He finally showed his code to master Leroy, seeking his approval. Leroy hit him with a stick. "When will you learn? Explicit loops are a poor man's tail-recursive functions." At that moment, the student became enlightened.

Complete solutions

Anton had been studying the dining philosophers problem and was ecstatic when he presented a novel solution to Damien Doligez. The great master did not show the same enthusiasm, though: all he said was "the solution is not complete". Anton knew that Doligez had written a machine-checked proof of correctness for the concurrent GC he developed in the early 90s for Caml Light, and began to work. After struggling with Coq, he came back to Doligez.

"I have proved that my solution prevents starvation and livelock."

Again, Doligez replied: "the solution is not complete".

Anton retorted: "My correctness proof has been verified mechanically. Moreover, my simulations show that my method allows for a large degree of concurrency, and it's easy to implement efficiently. Surely the solution is complete now?"

The Zen master then explained: "You claim to have solved the dining philosophers problem, yet you haven't expounded where the spaghetti come from."

Pierre Weis and the variance annotations


Read more...

2008-07-17 11:38 UTC Reexamining qsort, eager vs. lazy algorithm analysis and Ruby's (and other's) GC

Yesterday's post on The Comonad.Reader referred to my analysis of the list-based "quicksort" and described a faster function based on difference lists. It gave an example that troubled me for a while, and added that "the author [referring to me] was counting conses, unfortunately that isn't a valid metric":

   reverse (a:as) = reverse as ++ [a]
   reverse [] = []

This function exhibits O( n sup 2 ) performance, even though there seems to be one cons per level; this would show that just counting conses isn't enough, and invalidate my analysis. The catch is that there is not only one cons per level, and one must be careful and not forget the conses implied by the list concatenation operation. In my analysis of qsort, the number of conses verified

C(n) = p + C(p) + 1 + q + C(q) + p + 1

with n = p + q + 1, where the final p + 1 term represents the conses needed to concatenate the sorted sublists.

If the above function is analyzed the same way I did with qsort, the number of conses satisfies

C(n) = C(n - 1) + n - 1 + 1

which clearly implies C(n) = O ( n sup 2 ).

The key term is n + 1, which represents the conses associated to the concatenation, the same way p + 1 did in qsort's expression; had it not been there, the function would be O(n). So, while this example doesn't invalidate my proof, it serves as a good reminder of the cost of list concatenation and how easily it turns linear-time algorithms into O( n sup 2 ).

(Incidentally, it turns out that qsort would be O(n log n) even without the extra term caused by list concatenation, so the proof, albeit incorrect, would have reached the correct conclusion.)

Strict vs. non-strict, eager vs. lazy analysis

While it seems the original analysis was sound, a new question was raised indirectly when Edward Kmett wrote that

it's not the number of times you cons that matters, but the number of thunks that have to be forced to get to the answer

I've shown above that this is only true if you aren't careful when counting conses, but the problem it suggests deserves to be considered by itself: can the analysis under a strict semantics be taken as an upper bound of the complexity in the non-strict or lazy cases?

I always assumed that the answer was a straight "yes". Intuitively, under lazy evaluation you have thunks with the data necessary to evaluate their associated expressions only when needed, so surely at most as much work as if they were always evaluated is performed, right?

After pondering about this longly, I've found a case where this wouldn't be true, suggested directly by an alternative definition of non-strict semantics:

Non-strict semantics means that a function can have a definite value although its argument is undefined.

What if an algorithm depended on an undefined value for termination? In practice, this would mean that it relies on an expression raising an exception to stop the computation and return some result after recovery. Whereas in the strict case the expression would be evaluated unconditionally and the computation would terminate, with a non-strict semantics it would be left unevaluated if not needed, thus failing to raise the exception and finish in a timely manner.

Edit (2008-07-18): There's an example below for an impure language. Anyway impurity + laziness is a nonsensical combination and the example represents 2 different programs in one that happen to give the same answer, so I'll be happy to analyze pure programs and assume that "lazy evaluation always needs less reduction steps than eager evaluation" to obtain upper bounds (thanks apfelmus).

The above case is contrived, and such a program would be needlessly complex. In fact, in OCaml it would be expressed awkwardly with a ref, and in Haskell pattern matching over an Either value (implicit in bind) would normally force evaluation, thus behaving as the strict counterpart.

Barring this exception, it seems to me that analyzing under a strict semantics remains a good way to get an upper bound for the non-strict case, but lacking a formal proof (and even with one) you can never be too sure. Any pointers?

Something relevant might be found by following the trail given by this paper by P. Wadler, whose abstract states the following:

In a strict functional language, analysis of time complexity is straightforward, because of the following compositional rule: The time to evaluate (ƒ(g x)) equals the time to evaluate (g x) plus the time to evaluate (ƒ y), where y is the value of (g x). However, in a lazy language, this rule only gives an upper bound, possibly a crude one.

I haven't read the paper yet nor followed its references/citations to see if there's a proof for that. The problem I can foresee is that the counterexample I came up with might not happen in the simplified languages/computational models used in most papers.

At any rate, it seems that, for all but a few contrived programs (relying on "useless" expressions being evaluated to raise an exception and terminate), simple "eager" analysis remains a good way to obtain an upper bound for the non-strict case. If it is low enough, we can stop there and enjoy.

Superlinear GC work: why Ruby scales badly, and a possible explanation of qsort's behavior?


Read more...

2008-07-15 10:52 UTC Quicksort erratum

A few days ago, I called the following Haskell function, reminiscent of Quicksort and considered the epitome of beautiful code by many, unusable for taking "always O ( n sup 2 ) space and time":

   qsort []  = []
   qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++ qsort (filter (>= x) xs)

Moreover, I referred to this version (using a C-like DSEL in Haskell) due to Lennart Augustsson as a usable one (as opposed to a O ( n sup 2 ) algorithm):

   qsortM :: forall i a m r arr .
            (MonadRef m r, Num i, Ix i, MArray arr a m, Ord a Bool) =>
            arr i a -> m (arr i a)
   qsortM ma = runE $ do
       (lb, ub) <- embed $ getBounds ma
       let mlb = pure0 lb
           mub = pure0 ub
       a <- liftArray ma
   
       let qsort' l r =
               if1 (r > l) $ do
                   i <- auto l
                   j <- auto (r+1)
                   let v = a[l] :: E m a
                       iLTj = i < (j :: E m i)
                   while iLTj $ do
                       while ((i += 1) < mub && a[i] < v)
                           skip
                       while (a[j -= 1] > v)
                           skip
                       if1 iLTj $ do
                           a[i] =:= a[j]
                   a[l] =:= a[j]
                   qsort' l (j-1)
                   qsort' i r
        
       qsort' mlb mub
       return ma

I was wrong. Instead of analyzing it with some care, I believed the O ( n sup 2 ) claims I read long ago in some now forgotten blog. I just saw the list concatenations and reacted without thinking.

Finding an upper bound for the number of comparisons in the best case is easy. In fact, while I'am at it, I'll find a bound for the number of cons operations, which is clearly greater than that of comparisons.

Noting the number of conses C(n) for a list of size n, for n > 3 (there are a few irregularities for n = 1 to 3 because no cons is needed when appending or prepending [] to another list), C(n) = p + C(p) + 1 + q + C(q) + p + 1, where the problem for a list of length n has been divided into two halves (lists of length p and q) plus a lone pivot element. p (resp. q) conses are needed to build the list passed to the recursive call to qsort, C(p) (C(q)) conses are needed for the subproblem, and finally p + 1 conses are needed to concatenate the resulting sorted sublists.

The problem is how to find p and q. For the best case analysis, p approx q approx n over 2. The n list can be divided into the desired three parts as follows:


n mark = left floor n over 2 right floor + left ceiling n over 2 right ceiling 
= left floor n over 2 right floor + 1 + left ceiling n over 2 right ceiling - 1
= left floor n over 2 right floor + 1 + left floor { n + 1 } over 2 right floor - 1

n lineup = left floor n over 2 right floor + 1 + left floor { n - 1 } over 2 right floor

Therefore,

C(n) = C left ( left floor n over 2 right floor right ) + C left ( left floor { n - 1 } over 2 right floor right ) + 2 left floor n over 2 right floor + left floor { n - 1 } over 2 right floor + 2

(There are two choices for the number of conses in the final list concatenation, differing by 1. The above is a conservative one, which will give an upper bound.)

Even after referring to Graham & Knuth & Patashnik, I saw no immediate way to solve this recurrence (please drop a comment if you have a closed-form solution). Fortunately, I just need an upper bound so I can use

C(n) <= D(n) = 2 D left ( n over 2 right ) + 3 left ( n over 2 right ) + 2

which is easily proved by induction, since all I've done is to remove the floor operations.

Then, using the master theorem

D(n) = THETA left ( n sup { log sub 2 2 } log n right )

C(n) = O left ( n log n right )

(Note the use of O for C(n) and THETA for D(n).)

The above program thus uses at most O(n log n) space and time. The upper bound is satisfactory enough; in fact, it's the same as for the real quicksort, and, furthermore, the lowest achievable bound for a comparison sort.

So far, I've assumed that the program evaluates everything eagerly, which might not be the case in Haskell, as it has a non-strict semantics (which is not exactly the same as being lazy). Let's see if GHC is really smart and can recognize that qsort(l) is just a permutation of l to skip the sort altogether when some operation that doesn't depend on the order is performed (this is quite close to proving that the qsort function actually sorts the list, and would be an amazing exploit indeed).

To that end, I first rewrite the function in OCaml; if GHC is able to do less work thanks to its non-strict semantics, the solution will scale better than with OCaml:

let rec qsort = function
    [] -> []
  | x::xs -> qsort (List.filter ((>) x) xs) @ [x] @ qsort (List.filter ((<=) x) xs)

It's very similar to the Haskell one; the main difference is in the signs used with List.filter: ((>) x) means "x is greater than _", as opposed to "_ is greater than x" in Haskell.

The main function just creates a list of random integers, sums them up, and finally adds the numbers from the sorted list --- obtaining the same number, I hope (the sums are printed to stdout to make sure that computation is not omitted).

Here are the times I get:

times.png

The GHC binary takes nearly 800MB to sort a 3 million list, which explains why it suffers as n increases.

Well, this isn't... what was it, GHC 11? Being lazy does pay off if you try to get the nth (and in particular the first) element, though; using


Read more...
Last modified:2005/11/08 03:28:40
Keyword(s):
References:[Opening up my hiki wiki: bliki.rb plugin]