eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Conclusions about Wide Finder, C++, OCaml, JoCaml, Erlang and friends

The Wide Finder project is coming to an end, as it quickly becomes a micro-optimization race.

Unsurprisingly, the JoCaml solution is much faster than those written in interpreted (or even JIT-compiled, as Erlang) languages. As I write this, the last JoCaml implementation Tim Bray tested is two times faster than the second entry, a 345-line, multi-process (not multi-thread) Erlang program with a specialized pattern matcher. I've made a comparable (i.e., without regexps) JoCaml version that is about three times faster than its Erlang counterpart (and still shorter).

I have to say that I've been pleasantly surprised by Erlang's performance, though; the HiPE system is doing a good job. Nevertheless, it still yields but one third of the raw speed JoCaml achieves.

Correction [2007-11-19] Erlang's HiPE is not a JIT compiler in Java's sense, see erlhackr's comment.

The Wide Finder implementations tested by Tim can be classified in four groups, according to the languages used (read "languages" as "language implementations"):

  1. interpreted languages (Gawk, Groovy, Perl, PHP, Ruby...)
  2. interpreted languages with built-in (i.e., fast, and usually implemented in C) functionality appropriate for the problem at hand; in this case, sublinear string search (Python)
  3. JIT-compiled languages (Erlang; Java is glaringly missing)
  4. compiled languages

The good thing about the two last categories is that they allow you to implement the missing functionality while achieving good speed. With an interpreted language, either you are lucky and have an efficient routine that does what you need, or you have to write an extension --- the two orders of magnitude paid up-front are hard to recover with algorithmic improvements. In other words, if you're using a fast language, you're not stuck with the data structures and algorithms integrated in the language; you can make your own if needed.

The OCaml and JoCaml Wide Finders are still the only compiled ones in Tim's list. Fortunately, Bob Miller recently wrote a "fairly well optimized, special-purpose C (okay, C++) program". The comparison is enlightening.

Bob's code weights 800 lines after removing unneeded PCRE code (vs. 150 + 150, for the Bigstring module, for the fastest and bulkiest JoCaml one), uses a specialized pattern matcher with "several assumptions about Tim Bray's example regex hardcoded in", employs a flyweight substring representation that precludes "chunked" mmaping (the whole file is mapped at once)... In few words, it's a fair amount of code that has gone through considerable micro-optimization.

When I timed it on a dual-core 2GHz Core 2 Duo box, it was about 11% faster than the JoCaml program. A 5-line change to the latter to use a specialized hash table, as done by the C++ program, reversed the relation, JoCaml being 1% faster than C++ now. This margin is smaller than the measurement error, and, a fortiori, the variance introduced by the platform, the environment, or the compiler.

Further micro-optimizations are possible both in the JoCaml and the C++ program (there are a few more opportunities in the JoCaml one, though). At any rate, it seems fair to say that the C++ and the JoCaml versions are about as fast. (A factor of 3 is half an order of magnitude; 10% is negligible.)

The main advantage of the JoCaml version is that, unlike the C++ one (based on threads), it only takes a one-line change to make it run in multiple machines.

Some conclusions

I've drawn some conclusions from Tim's results and my reading of several Wide Finders written in Python, Ruby, Erlang, C++ and, of course, OCaml and JoCaml:

Languages and implementations

Mentioning several programming languages is risky; tiresome verbal strifes follow often. These direct observations, however, shouldn't be controversial:

  • Erlang's HiPE system works fairly well
  • Python, Ruby and friends can perform decently if most of the time is spent in library functions written in C
  • the OCaml substrate of JoCaml can generate code about as fast as hand-tuned C++ in many (most?) cases*1
  • we cannot dismiss constant factors. JoCaml code written by an individual in an afternoon can perform better than Erlang code refined over the course of one month by Erlang experts*2 on a per-core basis, while scaling just as well.

A few personal reflections:

  • JoCaml allows to express and reason about concurrent and distributed systems at very high level, leaving little place for the sort of bugs that infest code written with threads and shared state. (I have written a couple fault-tolerant systems in JoCaml that would have been much harder with threads.)
  • Erlang's major strength is fault-tolerance, scalability, etc., not raw performance; promoting Erlang by saying that it will be "faster given enough cores" is disingenuous because it conceals the existence of alternatives like JoCaml that can scale just as well but perform much better per core.

About the Wide Finder problem

  • the Wide Finder problem is trivially parallelizable
  • IO performance, as measured by Bonnie, is irrelevant in this benchmark, since the file is in cache anyway. The Wide Finder figures tell us more about abstraction costs than about how quickly logs can be analyzed.
  • I would be much more interested in the results of the STREAM benchmark on the T5120. They would tell us how close we are to saturating the memory bandwidth, and ultimately how fast Wide Finder could run on a single machine (given enough cores).


No Title - Caoyuan (2007-11-12 (Mon) 07:25:18)

Just a little bit correct: Steve, Anders and I are all Erlang newbie (I wrote my first Erlang code half year ago). The lack of document about efficient binary in Erlang was the biggest problem, after that, all things went staightforward. The performance of Erlang is much better than what I thought it was.

Anders code could be much shorter after he write more Erlang code (for example, half a year later) :-)

The \d\d\dx/(\d\d\d\d/\d\d/\d\d/ part in regexp is not necessary per my testing, but if I'd like to add it, it won't affect the performance, and in 3 more lines in Anders' code:

 case Bin of
   <<_:S/binary, C1,C2,C3,$x,$/,Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/,_/binary>> when
     C1 > 47, C1 < 58, C2 > 47, C2 < 58, C3 > 47, C3 < 58,
     Y1 > 47, Y1 < 58, Y2 > 47, Y2 < 58, Y3 > 47, Y3 < 58, Y4 > 47, Y4 < 58,
     M1 > 47, M1 < 58, M2 > 47, M2 < 58, D1 > 47, D2 < 58, D2 > 47, D2 < 58 ->

Of course, Erlang needs a good regexp module for binary.

For OCaml + JoCaml, the result is impressed and reasonable. I'd like to also see a Haskell solution.

Regards,

Caoyuan Deng

Anders 2007-11-12 (Mon) 09:11:46

Well, I wouldn't call myself a newbie. I wrote my first erlang program back in 96-97 I think. But I have only used erlang seriously for a few years.

Regarding the length and overall ugliness. It is the result of a number of incremental changes, where each change was done with a least effort approach, i.e. no serious refactoring has been done between the different versions.

Regarding the results in general.

1, I am surprised that erlang did as well as it did. Drag racing performance has never been a main goal for erlang.

2, Regarding the statement "...promoting Erlang by saying that it will be "faster given enough cores" is disingenuous..." I don't think I have ever seen that claim before. What we Erlangers normally say is that since a normal erlang system is already written with lots of concurrent processes, it is very easy to benefit from more cores without having to redesign the system.

mfp 2007-11-12 (Mon) 09:27:12

Just a little bit correct: Steve, Anders and I are all Erlang newbie (I wrote my first Erlang code half year ago).

Sorry for getting that wrong. Is it fair to characterize the Erlang solutions as a collective effort in part? I got that impression from several discussions in Erlang MLs.

That "regexp" code looks interesting :). Does Erlang have a way to extend the syntax (the way camlp4 does for OCaml) that would allow to write the above in a more readable way?

I agree that Erlang's performance is surprisingly good; I didn't expect it to be able to JIT-compile the B.-M. loop so well.

This Haskell implementation is the only one I've seen so far. It's around two times faster than a trivial Python implementation (which is IIRC 3X slower than wf-6.py, which is itself three times slower than the fastest JoCaml one). So all in all it's maybe 4 times slower than my JoCaml code, but I have no doubt that it can be improved until Haskell performs about as well as C++ or OCaml. I would be very surprised if Haskell couldn't deliver "at least 50% of the performance of a decent C(++) compiler".

mfp 2007-11-12 (Mon) 09:35:46

Anders >>

2, Regarding the statement "...promoting Erlang by saying that it will be "faster given enough cores" is disingenuous..." I don't think I have ever seen that claim before. What we Erlangers normally say is that since a normal erlang system is already written with lots of concurrent processes, it is very easy to benefit from more cores without having to redesign the system.

I don't believe any true Erlanger would say that, but I've heard similar things from people who have just learned it, the same way you got all those "Ruby on Rails makes you 20 times more productive" claims often coming from people who barely knew Rails, let alone Ruby.

I see we're all pleasantly surprised by Erlang's performance :-)

Caoyuan 2007-11-12 (Mon) 09:47:05

The Erlang solutions are collective effort, that's very good experience :-)

Haskell and OCaml are static typed, so the compiler can get much more optimal final code.

Erlang has parse_transform option, which can post-process the AST tree, so, I think I can write code as:

case Bin of
  <<_:S/binary, #,#,#,$x,$/,#,#,#,#,$/,#,#,$/,#,#,$/,_/binary>>  ->

where # means digital. The parse_transform can change the tokens to process it.

 

Caoyuan 2007-11-12 (Mon) 10:10:46

I just tested, followings should work.

case Bin of
 <<_:S/binary,'d','d','d',$x,$/,'d','d','d,'d',$/,'d','d',$/,'d','d',$/,_/binary>> ->

or

case Bin of
 <<_:S/binary,'#','#','#',$x,$/,'#','#','#,'#',$/,'#','#',$/,'#','#',$/,_/binary>> ->

Caoyuan 2007-11-12 (Mon) 10:14:55

or more direct: case Bin of

<<_:S/binary,d,d,d,$x,$/,d,d,d,d,$/,d,d,$/,d,d,$/,_/binary>> ->

Where d can be any atom, such as '#', s, 'S', '.' etc

caoyuan 2007-11-12 (Mon) 10:19:04

More:

<<_:S/binary,'^.'[^abc],'[abc]',$x,$/,d,d,d,d,$/,d,d,$/,d,d,$/,_/binary>> ->

Maybe I'll write a parse_transform for this :-)

Anders 2007-11-12 (Mon) 10:19:35

mfp>>

Is it fair to characterize the Erlang solutions as a collective effort in part?

Well, I suppose that You could call it that in some ways. But it was more a case of several persons working independently, but with the code published so often a good idea from one person would quickly get assimilated by others.

Personally I did some early work and then lost interest, until Steve published his tbray16, and then I started playing with that to see how far I could go.

Name:
Comment:
TextFormattingRules
Preview:

Erlang is not a "JIT-compiled language" - erlhackr (2007-11-13 (Tue) 09:42:10)

HiPE is not a JIT compiler, at least not in the Java sense of something that runs in the background, profiling, and compiling certain functions to native code. You can call HiPE manually during runtime, but the code won't be any different from using erlc +native. And unlike Java, Erlang bytecode will not be automatically translated to native code.

Now HiPE Tool is a JIT profiling extension to HiPE but AFAIK it isn't included in the Erlang distribution.

Name:
Comment:
TextFormattingRules
Preview:

Splitting file into chunks - eao197 (2007-11-14 (Wed) 06:53:24)

Could you explain how you split a log file into chunks. I'm not an OCaml programmer, so I suppose you devide file into blocks and then map each block into memory. If so how you process situation when some line goes into two blocks? For example: an original line: "GET /ongoing/When/123x/4567/89/01 ..." is splitted into "GET /ongoin" in the end of first block and "g/When/..." in the begin of second block.

mfp 2007-11-19 (Mon) 12:11:17

The code makes sure the file is split at newline boundaries by seeking to the desired offset, reading until the next newline character and recording the current position. This requires one seek per 50/100MB chunk, which isn't too onerous.

In the single-core implementation based on read(2), the incomplete line is copied to the start of the buffer and new data is appended to it.

eao197 2007-11-22 (Thr) 01:50:43

Thanks for your explanation!

Name:
Comment:
TextFormattingRules
Preview:

take the Perl on! - Alexy (2007-11-21 (Wed) 13:48:05)

They've beat you with ... Perl! Crank 'em lamdbas up! :)

mfp 2007-11-21 (Wed) 15:01:40

nah, Tim hasn't timed the fastest JoCaml versions yet. They are up to 50% faster than the one he's tried.

Perl's regexps are quite the thing, they seem to have sublinear search built-in.

Name:
Comment:
TextFormattingRules
Preview:

Use this form to create a new top-level comment; for direct replies to existing comments, use the text entries you'll find below.

Name:
Subject:

TextFormattingRules
Preview:
Last modified:2007/11/12 05:04:35
Keyword(s):[blog] [frontpage] [widefinder] [jocaml] [erlang] [ocaml]
References:

*1 Xavier Leroy's 'OCaml delivers at least 50% of the performance...' holds

*2 the Erlang solutions have been discussed to great lengths in several mailing lists; I wrote and benchmarked the OCaml/JoCaml ones in a few hours. Yet, it takes 3 cores for the best Erlang implementation to be as fast as the JoCaml one on a single one.