14 June 2010 at 11:52 ocamlmq, a small footprint, no-fuss STOMP broker for task queues and IPC

ocamlmq is a STOMP message broker with features that make it especially suitable for implementing task queues and communication between subsystems:

  • persistent queues, scaling over arbitrarily large numbers of queues with constant memory usage (i.e. supports millions of queues)

  • strong durability guarantees: messages are guaranteed to have been saved to disk by the time the sender gets a message receipt

  • message priorities

  • per-subscription prefetch limit for queue messages

  • error handling and ACK timeout: if a subscriber doesn't ACK a message after a (message-specific) timeout, the message will be sent to another subscriber. Messages are also resent automatically if a subscriber dies and its connection expires.

  • topic subscriptions (messages broadcast to all subscribers) with prefix matching

  • support for high numbers of subscriptions: millions of subscriptions pose no problem

  • simple extensions within the STOMP protocol to report the number of messages in a queue and the number of subscribers to a queue or topic

I wrote ocamlmq because I could not get ActiveMQ or RabbitMQ (+ its STOMP adapter) to work the way I wanted. The problems I ran into include:

  • excessive memory footprint (ActiveMQ and RabbitMQ), both at startup (e.g. >120MB for ActiveMQ vs. <3 MB for ocamlmq) and as new queues are created or new topic subscriptions are added

  • bad performance with ActiveMQ's scalable (to thousands of queues) storage backends (KahaDB, JDBC)

  • bad performance in RabbitMQ's topic message dispatch: RabbitMQ was doing a linear scan of the subscription table per dispatch

  • RabbitMQ did not guarantee that persistent messages had been saved to disk before sending the message receipt, which could lead to data loss (clarification 2010-06-15 22:34: messages would only be lost if RabbitMQ or the system crashed; see Jason's comment and my replies)

ocamlmq is written in OCaml. I originally estimated that a simple STOMP message broker with the feature set I wanted could be done in around 2000 lines. It took less than 1000, which later grew to ~1200 as I added new features (prefix topic subscriptions and control messages). Because ocamlmq is relatively small (the core of the broker takes around 500 lines of code), it is easy to extend and there's little space for bugs. The server is fairly efficient and abstracted over a storage backend; currently only PostgreSQL's is implemented (< 150 lines of code).

Limitations

ocamlmq works well in the intended use case (persistent queues and transient topic destinations, with possibly many queues and subscriptions), but it has some limitations which preclude its use in other domains:

  • ocamlmq is not designed to scale beyond several hundred / a few thousand simultaneous connections (it will work, but performance will be affected)

  • there is no flow control for topic messages (in the intended use case, topic messages are assumed to be relatively small and processed fast)

  • messages are limited to 16 MB on 32-bit platforms

  • the PostgreSQL storage backend can only persist a few thousand messages per second (note that ocamlmq allows >50K/s persistent message bursts in async mode)

  • ocamlmq does not support very high message rates (ocamlmq delivers only ~20000 ~40000 messages/second on a 3GHz AMD64 box)

If you need complex routing rules, scalability to many thousand simultaneous connections or other enterprise messaging features, you'll be better served by AMPQ or JMS brokers. ActiveMQ, in particular, is very configurable, so it'll most likely fit the bill if memory consumption and scaling to many subscriptions are not a concern.

Scalability

ocamlmq has been designed to support millions of queues and topic subscriptions without undue memory usage. This table summarizes the time complexity of some STOMP operations:

      SEND to queue           O(log subscribers)
      SEND to topic           O(subscribers + log (total subs))
      SUBSCRIBE to queue      O(log (total subs))
      SUBSCRIBE to topic      O(log subscribers)
      ACK                     O(1)

ocamlmq needs typically around 150 bytes per subscription, so 1 million subscriptions will not take much more than 150 MB of memory (compare to >120MB required by ActiveMQ on startup). No extra memory is needed for queues, so you can use lots of them with no concerns for memory usage.

Read more...

21 July 2009 at 07:49 Efficient low-level VMs implemented in high-level (functional) languages

The 2006 edition of the ICFP programming contest, one of the most enjoyable to date, introduced the Universal Machine used by a fictional society to program around 200 BC. Many people have tried their had at implementing the UM in a variety of languages. This table lists several C, C++ and Haskell implementations.

Even though it is clearly hard to beat C speed-wise here, high-level, functional programming languages like Haskell or OCaml can come quite close in spite of the very low-level nature of the problem. I'm getting 75% of C's performance (1m21s vs. 1m for the "SANDmark" benchmark on a 3GHz AMD64) in OCaml with straightforward code that performs array bound checking (i.e., unlike the other implementations, malicious machine images cannot take over the process). This makes it faster than the best performing C++ implementation on this table, and comparable to other C implementations; obviously, this says more about the C++ implementations than about the language itself, but it shows that they're all in the same league. (BTW, Haskell has improved a lot since GHC 6.5: the "ugly, fast" Fast.hs is only 2.25 times slower than edwardk.c with 6.8.2, and a bit worse with 6.10.3: 2m18s = 2.3X; um6.hs, less harmful to the eye, is 5.5X slower than edwardk.c with 6.10.3, though --- virtually the same as GHC 6.5. I had to change a few lines of code as some APIs have changed since.)

Here's the OCaml code:

Read more...

02 July 2009 at 13:22 Hash tables: separate chaining vs. double hashing

In my earlier finite map benchmarks which compared several hash tables and functional maps (AVL trees and ternary trees), separate chaining (with $$\alpha = 0.47$$) beat double hashing ($$\alpha = 0.42$$) at unsuccessful searches (1% hit rate), but was slower at successful ones.

Theory predicts the following average number of probes for unsuccessful and successful lookups (expressions below): Unsuccessful searches, same load factor Successful searches, same load factor

Separate chaining looks much better here, but the graphs are misleading, because we don't actually care as much about the load factor as about memory usage, so we want to compare the collision resolution schemes when the latter is the same.

If we use a linked list for the collision chain, this represents one word of overhead per item compared to double hashing. For instance, if the load factor with separate chaining is 1, we can afford to have a table twice as big with double hashing for the same amount of memory and a load factor of 50%. In other words, $$N + n = N'$$ where $$N$$ and $$N'$$ are the sizes of the tables for separate chaining and double hashing, and $$n$$ the number of elements. The respective load factors are

\[\begin{eqnarray*} \alpha & = & \frac{n}{N}\\ \alpha' & = & \frac{n}{N'}\\ & = & \frac{n}{N+n}\\ \alpha' & = & \frac{\alpha}{1+\alpha}\end{eqnarray*} \]

The expressions for the average number of probes for unsuccessful and successful searches are (Knuth, TAOCP Vol 3, Section 6.4), for separate chaining

Read more...

02 July 2009 at 10:21 Math typesetting with jsMath

I've added math typesetting support to Ocsiblog, which runs this site, using jsMath: (double-click on the equations to see the sources)

\[\begin{eqnarray*} \nabla.D & = & \rho\\ \nabla.B & = & 0\\ \nabla\times E & = & -\frac{\partial B}{\partial t}\\ \nabla\times H & = & j+\frac{\partial D}{\partial t}\end{eqnarray*} \]

When jsMath is enabled, you'll see in the bottom right-hand corner a button that triggers a control panel allowing you to set several options such as the scale or the rendering mechanism (TeX fonts, image fonts or native Unicode fonts; you can pick the one that works best for you and save the configuration --- the best one is TeX fonts, if you have them, but Unicode fonts is quite decent). For best viewing, you can get the TeX fonts here (the page includes detailed install instructions for Windows, OSX and Un*x).

Here's what the above equations should look like with TeX and native Unicode fonts: TeX Unicode

19 June 2009 at 10:16 Finite maps galore: imperative code strikes back

Matías Giovannini has been implementing purely functional data structures based on Bentley and Sedgewick's ternary search trees (TST) and pitching them against OCaml's imperative hash table (Hashtbl). In the interval between the initial results and his showing his first version (called Trie_map from now on), I made a functional TST ("Ternary") which compares favorably to both Trie_map and the revised implementation ("Trie_map_mod").

I also resuscitated a hash table implementation of mine that uses open addressing with double hashing (in contrast to Hashtbl's external chaining) and exhibits vastly improved performance.

I'll show the benchmark results first, to be followed by some code and analysis.

Benchmarks

I benchmark the following finite map implementations:

  • imperative:

    • Fasthashtbl: hash table with open addressing and double hashing

    • Hashtbl: the hash table from INRIA's stdlib (external chaining)

    • Hashtbl_mod: Hashtbl with more aggressive resizing (lower load factor)

    • Hashtbl_hval: Hashtbl_mod with caching of the hash value

  • functional:

    • Trie_map: TST with coalesced constructor for nodes with and without values

    • Trie_map_mod: TST with different constructors for leaves and inner nodes with/without a value

    • Ternary: TST with separate constructor for nodes with and without values but no leaf constructor

    • Map: the Map implementation from INRIA's stdlib (AVL tree)

All the finite maps are compared by

  1. building a map with 98568 English words, which are added in (a) randomized and (b) lexicographic order

  2. measuring the minimal lookup cost with constant lookups (so that everything is in the L1 cache)

  3. timing successful lookups by looking for the 98568 words

    1. in the same (random) order the elements were added

    2. in randomized order

    3. in lexicographic order

    4. in reverse lexicographic order

    ... against the maps built in randomized and lexicographic order

  4. timing searches against the previously generated maps (randomized and lexicographic order) using

    1. a larger set of English words (217625, 45.3% hit rate)

    2. a set of Spanish words (86016 words, 1.3% hit rate)

    ... in randomized and lexicographic order

For the sake of exhaustiveness, I measure the iteration and functor overhead and detract it from the lookup timings. The total size of the structures (including the size of the strings) is also measured. All the measurements are repeated 20 times, and the best time is retained. The benchmarks are performed in a single program execution to achieve steady state (major heap resized to required size). The set of words used in the map is large enough to ensure the structure (taking 5.7MB to over 12MB) doesn't fit in the 2MB L2 cache.

The full results can be found here (also includes hash table benchmark with integer keys).

This table shows the results for a map built in randomized order (refer to the full output for those in lexicographic order). Speed is measured in ops/sec, and size in bytes ("Find (hit, orig)" means that the lookups are performed in the same random order as the insertions).

Insertion (randomized order) and lookup

Data structure Size Load factor Insertion (rand) Find (constant) Find (hit, rand) Find (hit, orig) Find (hit, sorted) Find (45.3%, rand) Find (1.3%, rand) Find (1.3% sorted)
FastHashtbl 7325544 0.376 1509558 16701222 2669067 3554728 3958575 2347248 3243197 5133702
Hashtbl 5752664 1105131 5617227 2136562 2296294 2791035 1623434 1731465 2229713
Hashtbl_mod 7325528 0.376 639994 7373058 2711193 2963744 3893034 2249186 2926475 4403349
Hashtbl_hval 8114072 0.376 675892 10457458 2769387 3056772 4153404 2435018 3436621 5542555
Ternary 11894440 179599 14739138 1484213 1556592 4576735 1567019 2419841 6466320
Trie_map 99834 11870073 985476 1024257 3022102 1085412 1797166 5763597
Trie_map_mod 12417768 164725 8476101 1284442 1340733 3246878 1417483 2151898 4811496
Map 6805440 167046 1168963 674209 671368 1026645 556788 633682 986727

Implementation notes

The source code for all the containers and the benchmarks can be obtained here.

I modified Hashtbl first by changing its resizing policy so that the load factor is the same as in Fasthashtbl (Hashtbl_mod), and second by caching the hash value in the nodes (Hashtbl_hval), which makes successful lookups faster when there are collisions (since the key comparison can be bypassed when the hash values differ) and helps a lot with unsuccessful ones.

Fasthashtbl, Hashtbl_mod and Hashtbl_hval only allow the load factors to grow to 50% before resizing the underlying array to twice its former size; the average load factor is thus 37.5%. Hashtbl, on the other hand, allows it to grow to 2.0, which, theory predicts, would require 13.4 probes per unsuccessful search, and 4.6 per successful one (Knuth, TAOCP Vol 3, Section 6.4, page 524). For a 37.5% load factor, Hashtbl, Hashtbl_mod and Hashtbl_hval will require 1.1 and 1.2, and Fasthashtbl (which uses open addressing with double hashing) will need 1.6 and 1.25 --- double hashing is about as good for successful lookups but worse at unsuccessful ones.

Trie_map_mod and Ternary are very similar: the only differences are that the latter doesn't have a separate Leaf node type, and its code has undergone a few manual optimizations explained below.

Some observations

Functional vs. imperative

Read more...