Comparing lightweight threads
The Computer Language Benchmarks Game includes a benchmark that measures context switch performance. The entries can be classified into three categories:
- those that rely on lightweight threads, namely Haskell GHC, Erlang HiPE and Mozart/Oz. I don't know whether Smalltalk VisualWorks and Scala belong here too; their performance is far from the top three entries. Update (2008-03-25) According to Julian Morrison and Calum, Smalltalk and Scala have lightweight threads and (lightweight) actors on top of Java thread pools, respectively (see comments below).
- those that use POSIX threads, with or without actual parallelism, including most other entries
- Ruby, a category of its own (?): very expensive user-mode threads with no parallelism
The OCaml entry is amongst the fastest pthread-based ones, but still markedly slower than the top entry, by around an order of magnitude. The version I wrote some time ago, based on the Lwt cooperative lightweight thread library, is close in performance to GHC. Some analysis reveals interesting facts about GHC's concurrency support and Lwt.
Performance
Here are the figures I get on a dual-core AMD Athlon 64 X2 6000+ in 32 bit mode (Why 32 and not 64? Because it's faster in this benchmark; 64 bit pointers are heavy and we get nothing in return in this case.)
Update: I've also timed the Mozart/Oz version and GCC C+pthreads as a baseline.
| implementation | memory usage | time (s) |
| Haskell GHC 6.8.2 | 2680KB | 1.22 |
| OCaml ocamlopt 1024Kword minor heap | 5178KB | 1.85 |
| Haskell GHC 6.8.2 -threaded, -N1 | 2760KB | 1.9 |
| OCaml ocamlopt 256Kword minor heap | 2016KB | 2.05 |
| OCaml ocamlopt 64Kword minor heap | 1228KB | 3.06 |
| Erlang R12B-1 HiPE | 5996KB | 3.96 |
| Mozart/Oz | 3788KB | 4.10 |
| OCaml ocamlopt 32Kword minor heap | 970KB | 4.24 |
| Haskell GHC 6.8.2 -threaded, -N2 | 3300KB | 15.27 |
| GCC C (POSIX threads) | 4520KB | 28.7 |
The Haskell code was compiled with -O2; I used erlc +native +"{hipe, [o3]}" for Erlang and -O3 -fomit-frame-pointer -march=athlong for GCC.
GC overhead
The OCaml version is clearly GC bound, and performance increases as the minor heap is enlarged, decreasing the amount of GC work. Whereas with the default 256KB heap the Erlang program is slightly faster, when OCaml is allowed to use comparable amounts of memory, it is over twice as fast.
No significant change in the speed of the GHC version was observed when I changed the size of the allocation area with the -A RTS option. Clearly, Lwt causes far more allocation than Haskell's MVars (probably because the latter is written in C--, more on that below). The GC statistics obtained with Gc.print_stat for OCaml and -sstderr for GHC confirm that intuition:
OCaml 1024Kword stack OCaml 32Kword stack minor_words: 369560098 minor_words: 369589896 promoted_words: 4075042 promoted_words: 130355359 major_words: 4075554 major_words: 130355871 minor_collections: 352 minor_collections: 11278 major_collections: 176 major_collections: 5639 [...] [...]
The profiler indicates that over 30% of the time is spent in GC and caml_modify (which is used in the write barrier). This contrasts with GHC:
80,753,700 bytes allocated in the heap
1,984,712 bytes copied during GC (scavenged)
5,120 bytes copied during GC (not scavenged)
634,880 bytes maximum residency (1 sample(s))
154 collections in generation 0 ( 0.00s)
1 collections in generation 1 ( 0.00s)
[...]
%GC time 0.3% (1.2% elapsed)
[...]
Even though the OCaml version allocates over 1400MB where GHC takes but 77MB, and scavenges 8 times more data, it is not far speed-wise.
If anybody knows how to obtain GC statistics for Erlang, that might yield some interesting figures.
Implementation
The solution based on Lwt naturally resembles the Haskell version. Threads communicate via synchronised mutable variables ("MVars") which act as mutable cells with two basic operations, put and take, with the following semantics:
- a thread that tries to take a value from an empty MVar blocks until another thread puts a value
- a thread that tries to put a value into an non-empty MVar blocks until it is emptied
For the sake of simplicity, I have relaxed the semantics of put so that it doesn't block when the MVar isn't empty (this means the previous value would be lost). I call the resulting abstraction an MVar' (the MVar semantics require 6 more lines of code; the resulting executable is around 10% slower) in the code below*1
module MVar' =
struct
open Lwt
type 'a t = {
mutable v : 'a option;
rd : 'a Lwt.t list ref
}
let create () = { v = None; rd = ref [] }
let make x = { v = Some x; rd = ref [] }
let awaken l x = match !l with
[] -> ()
| hd::tl -> l := tl; wakeup hd x
let add_wait l = let w = wait () in l := w :: !l; w
let take t = match t.v with
Some x -> t.v <- None; return x
| None -> add_wait t.rd
let put t x = match !(t.rd) with
hd::tl -> awaken t.rd x
| [] -> t.v <- Some x
end
(I've used a list of blocked readers instead of a queue for conciseness; it makes no difference in the threadring benchmark.)
The implementation is very simple. An MVar' holds a value v and a queue of blocked readers (in the MVar, there's another queue for blocked writers). The take operation checks whether the MVar' is empty. If it is (i.e., t.v is None) the current thread is added to the queue; otherwise (when t.v matches Some x), the cell is emptied and the value returned.
Whereas take requires three to four lines of OCaml code, GHC's implementation of that primitive operation (ghc6-6.8.2/rts/PrimOps.cmm) takes around 180 lines of C--. This holds for the rest of the primitives too: Lwt is written in pure OCaml and all the non-inlinable GHC primops are written in C-- (PrimOps.cmm takes around 2200 LoCs). This explains in part their relative performance (another reason is that ocamlopt doesn't avoid extra closures in monadic code the way GHC does).
Last year's "Lightweight concurrency primitives for GHC" paper introduces a new approach where most of the concurrency support is moved to Haskell code that relies on a small C/Cmm runtime system, leading to less error-prone and more maintainable code at the cost of a small performance drop. It seems GHC is moving in that direction, but as of GHC 6.8.2 the shift hasn't been completed (putMVar is still written in Cmm).
Benchmark code
Here's the threadring benchmark per se, built atop the above MVar' abstraction:
open Lwt let ring = 503 let rec thread i l r = perform m <-- MVar'.take l; let () = if m == 1 then print_int i; MVar'.put r (m-1) in if m > 0 then thread i l r else return () let () = let a = MVar'.make (int_of_string Sys.argv.(1)) in let z, ts = Array.fold_left (fun (l, ts) i -> let r = MVar'.create () in let t = thread i l r in (r, t :: ts)) (a, []) (Array.init (ring - 1) ((+) 2)) in ignore_result (choose (thread 1 z a :: ts)) let () = at_exit (fun () -> Gc.print_stat stdout)
It uses the pa_monad syntax extension, if only to get rid of one bind.
Improvements to GHC's thread scaling - dons (2008-03-24 (Mon) 11:54:54)
Interesting post!
I've one question, and one note.Firstly, does Lwt support mulitple processors? Secondly, you might be interested in this ticket, about GHC's forkIO threads not scaling linearly,
http://hackage.haskell.org/trac/ghc/ticket/1589
a patch to fix this should appear in the 6.10 release in a few months.
-- Don S.
mfp 2008-03-24 (Mon) 12:43:01
According to Simon Marlow, it is a GC problem, and he reports a GC time of 76.4% after the fix. In threadring, I'm getting only 0.3% GC time, so the bug doesn't seem to be triggered by a mere 503 threads (in the bug ticket, 10^5 to 10^6 threads were used).
Unlike GHC with -threaded, Lwt doesn't support multiple processors. In this example, however, using more than one processor doesn't help (-N2 makes it an order of magnitude slower). In OCaml, you'd have to use several processes, maybe coordinated with JoCaml's Join Calculus. It seems to me that lightweight threads are more about clean system models than about concurrency parallelism, so using different things for different purposes doesn't feel too onerous. Nonetheless, I can see the value in using the same abstraction for "lightweight concurrency" and "actual parallelism". This is what happens in the case of JoCaml for parallel and distributed programming: you don't care whether a channel is local or remote. So, in the concurrency/parallelism range you have:
| language | (lightweight) concurrency | parallelism | distributed programming |
| Haskell | Control.Concurrent abstractions | X? | |
| OCaml | Lwt/Thread | JoCaml | |
What should stand for the above X, i.e. what does one use in Haskell for distributed programming? Is there some implementation of message-passing concurrency or better yet a Haskell extension with support for distributed programming? I've only seen O'Haskell, which seems incomplete and unmaintained ("O'Hugs, the O'Haskell interpreter New release January 31 2001!").
mfp 2008-03-24 (Mon) 12:50:41
Haskellwiki also points to Glasgow Distributed Haskell, which seems a bit old: it needs the GHC 5.00 sources and the last activity report, from 2006, says it's an alpha-release.
comparison - rektide (2008-03-24 (Mon) 17:39:56)
Other lightweight thread benchmarks that would be fun to see:
- Mozart/Oz
- Protothreads
- Mono Continuations
- Stackless Python
Heres a totally random link to a comparison of Stackless Python and Erlang.
mfp 2008-03-25 (Tue) 11:38:10
I've added the Mozart/Oz times for reference, and implemented yet another version using protothreads; I'll write about it later.
No Title - Julian Morrison (2008-03-25 (Tue) 06:09:50)
Smalltalk and Scala use lightweight threads that are implemented in the language (rather than in the underlying VM). That puts them intermediate between OS-heavy and VM-lightweight in speed, but I expect they scale up to 1000s of threads like VM-lightweight.
Calum 2008-03-25 (Tue) 11:36:41
I think that the Scala actors implementation is backed by a Java threadpool which is shared across actors, so it is quite similar to the lighter-weight systems. Technically there won't be thousands of "threads" though, there'll be a large number of actors accessing a smaller number of threads.
mfp 2008-03-25 (Tue) 11:42:18
In the case of Smalltalk, are lightweight threads implemented on top of continuations? Lwt is also implemented in the language, yet it is much closer in performance to GHC's C-- concurrency support.
cwp 2008-03-25 (Tue) 11:57:33
Smalltalk VMs typically provide a few primitives for preempting running threads, working with semaphores etc, and the rest is implemented in Smalltalk with calls to the primitives.
*1 I use references instead of mutable fields for the write and reader queues because it simplifies the MVar code a bit.
Keyword(s):[blog] [frontpage] [lightweight] [thread] [concurrency] [ocaml] [haskell] [ghc] [erlang] [ruby]
References:[The lightest lightweight threads, Protothreads]