<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>
eigenclass
</title>
<link>
http://eigenclass.org
</link>
<atom:link href="http://eigenclass.org/R2/feeds/rss2/all" rel="self" type="application/rss+xml" />
<description>
eigenclass
</description>
<ttl>
180
</ttl>
<item>
<title>
Efficient low-level VMs implemented in high-level (functional) languages
</title>
<link>
http://eigenclass.org/R2/writings/efficient-vms-in-hlls
</link>
<description>
&lt;p&gt;The &lt;a href=&quot;http://www.boundvariable.org/&quot;&gt;2006 edition of the ICFP programming contest&lt;/a&gt;, one of the most enjoyable to date, introduced the &lt;a href=&quot;http://www.boundvariable.org/um-spec.txt&quot;&gt;Universal Machine&lt;/a&gt; used by a fictional society to program around 200 BC. Many people have tried their had at implementing the UM in a variety of languages. &lt;a href=&quot;http://www.cse.unsw.edu.au/~dons/um.html&quot;&gt;This table&lt;/a&gt; lists several C, C++ and Haskell implementations.&lt;/p&gt;&lt;p&gt;Even though it is clearly hard to beat C speed-wise here, high-level, functional programming languages like Haskell or OCaml can come quite close in spite of the very low-level nature of the problem. I'm getting 75% of C's performance (1m21s vs.  1m for the &amp;quot;SANDmark&amp;quot; benchmark on a 3GHz AMD64) in OCaml with straightforward code &lt;em&gt;that performs array bound checking&lt;/em&gt; (i.e., unlike the other implementations, malicious machine images cannot take over the process).  This makes it faster than the best performing C++ implementation &lt;a href=&quot;http://www.cse.unsw.edu.au/~dons/um.html&quot;&gt;on this table&lt;/a&gt;, and comparable to other C implementations; obviously, this says more about the C++ implementations than about the language itself, but it shows that they're all in the same league.  (BTW, Haskell has improved a lot since GHC 6.5: the &amp;quot;ugly, fast&amp;quot; &lt;a href=&quot;http://www.cse.unsw.edu.au/~dons/tmp/Fast.hs&quot;&gt;Fast.hs&lt;/a&gt; is only 2.25 times slower than edwardk.c with 6.8.2, and a bit worse with 6.10.3: 2m18s = 2.3X; &lt;a href=&quot;http://www.cse.unsw.edu.au/~dons/tmp/um6.hs&quot;&gt;um6.hs&lt;/a&gt;, less harmful to the eye, is 5.5X slower than edwardk.c with 6.10.3, though --- virtually the same as GHC 6.5. I had to change a few lines of code as some APIs have changed since.)&lt;/p&gt;&lt;p&gt;Here's the OCaml code:&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;http://eigenclass.org/R2/writings/efficient-vms-in-hlls&quot;&gt;Read more...&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>
Tue, 21 Jul 2009 07:49:32 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/efficient-vms-in-hlls
</guid>
</item>
<item>
<title>
Hash tables: separate chaining vs. double hashing
</title>
<link>
http://eigenclass.org/R2/writings/separate-chaining-vs-double-hashing
</link>
<description>
&lt;div&gt;&lt;b&gt;This entry uses &lt;a href=&quot;http://www.math.union.edu/locate/jsMath&quot;&gt;jsMath&lt;/a&gt; to display mathematical expressions. Please go to the main site
                   if some fail to render.&lt;/b&gt;&lt;hr&gt;&lt;/div&gt;&lt;p&gt;In my earlier &lt;a href=&quot;http://eigenclass.org/R2/writings/finite-map-benchmarks&quot;&gt;finite map benchmarks&lt;/a&gt; which compared several hash tables and functional maps (AVL trees and ternary trees), separate chaining (with $$\alpha = 0.47$$) beat double hashing ($$\alpha = 0.42$$) at unsuccessful searches (1% hit rate), but was slower at successful ones.&lt;/p&gt;&lt;p&gt;Theory predicts the following average number of probes for unsuccessful and successful lookups (expressions below): &lt;img src=&quot;http://eigenclass.org/R2/files/separate-chaining-vs-double-hashing/unsuccessful1.png&quot; alt=&quot;Unsuccessful searches, same load factor&quot;&gt; &lt;img src=&quot;http://eigenclass.org/R2/files/separate-chaining-vs-double-hashing/successful1.png&quot; alt=&quot;Successful searches, same load factor&quot;&gt;&lt;/p&gt;&lt;p&gt;Separate chaining looks &lt;em&gt;much&lt;/em&gt; better here, but the graphs are misleading, because we don't actually care as much about the load factor as about memory usage, so we want to compare the collision resolution schemes when the latter is the same.&lt;/p&gt;&lt;p&gt;If we use a linked list for the collision chain, this represents one word of overhead per item compared to double hashing. For instance, if the load factor with separate chaining is 1, we can afford to have a table twice as big with double hashing for the same amount of memory and a load factor of 50%. In other words, $$N + n = N'$$ where $$N$$ and $$N'$$ are the sizes of the tables for separate chaining and double hashing, and $$n$$ the number of elements. The respective load factors are&lt;/p&gt;&lt;div&gt;&lt;code&gt;\[\begin{eqnarray*}
\alpha &amp;amp; = &amp;amp; \frac{n}{N}\\
\alpha' &amp;amp; = &amp;amp; \frac{n}{N'}\\
 &amp;amp; = &amp;amp; \frac{n}{N+n}\\
\alpha' &amp;amp; = &amp;amp; \frac{\alpha}{1+\alpha}\end{eqnarray*}
\]&lt;/code&gt;&lt;/div&gt;&lt;p&gt;The expressions for the average number of probes for unsuccessful and successful searches are (Knuth, TAOCP Vol 3, Section 6.4), for separate chaining&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;http://eigenclass.org/R2/writings/separate-chaining-vs-double-hashing&quot;&gt;Read more...&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>
Thu, 02 Jul 2009 13:22:19 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/separate-chaining-vs-double-hashing
</guid>
</item>
<item>
<title>
Math typesetting with jsMath
</title>
<link>
http://eigenclass.org/R2/writings/ocsiblog-jsmath-support
</link>
<description>
&lt;div&gt;&lt;b&gt;This entry uses &lt;a href=&quot;http://www.math.union.edu/locate/jsMath&quot;&gt;jsMath&lt;/a&gt; to display mathematical expressions. Please go to the main site
                   if some fail to render.&lt;/b&gt;&lt;hr&gt;&lt;/div&gt;&lt;p&gt;I've added math typesetting support to Ocsiblog, which runs this site, using &lt;a href=&quot;http://www.math.union.edu/~dpvc/jsMath/welcome.html&quot;&gt;jsMath&lt;/a&gt;: (double-click on the equations to see the sources)&lt;/p&gt;&lt;div&gt;&lt;code&gt;\[\begin{eqnarray*}
\nabla.D &amp;amp; = &amp;amp; \rho\\
\nabla.B &amp;amp; = &amp;amp; 0\\
\nabla\times E &amp;amp; = &amp;amp; -\frac{\partial B}{\partial t}\\
\nabla\times H &amp;amp; = &amp;amp; j+\frac{\partial D}{\partial t}\end{eqnarray*}
\]&lt;/code&gt;&lt;/div&gt;&lt;p&gt;When jsMath is enabled, you'll see in the bottom right-hand corner a button that triggers a control panel allowing you to set several options such as the scale or the rendering mechanism (TeX fonts, image fonts or native Unicode fonts; you can pick the one that works best for you and save the configuration --- the best one is TeX fonts, if you have them, but Unicode fonts is quite decent).  For best viewing, you can &lt;a href=&quot;http://www.math.union.edu/~dpvc/jsMath/users/fonts.html&quot;&gt;get the TeX fonts here&lt;/a&gt; (the page includes detailed install instructions for Windows, OSX and Un*x).&lt;/p&gt;&lt;p&gt;Here's what the above equations should look like with TeX and native Unicode fonts: &lt;img src=&quot;http://eigenclass.org/R2/files/ocsiblog-jsmath-support/texfonts.png&quot; alt=&quot;TeX&quot;&gt; &lt;img src=&quot;http://eigenclass.org/R2/files/ocsiblog-jsmath-support/unicode.png&quot; alt=&quot;Unicode&quot;&gt;&lt;/p&gt;
</description>
<pubDate>
Thu, 02 Jul 2009 10:21:46 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/ocsiblog-jsmath-support
</guid>
</item>
<item>
<title>
Finite maps galore: imperative code strikes back
</title>
<link>
http://eigenclass.org/R2/writings/finite-map-benchmarks
</link>
<description>
&lt;div&gt;&lt;b&gt;This entry uses &lt;a href=&quot;http://www.math.union.edu/locate/jsMath&quot;&gt;jsMath&lt;/a&gt; to display mathematical expressions. Please go to the main site
                   if some fail to render.&lt;/b&gt;&lt;hr&gt;&lt;/div&gt;&lt;p&gt;Matías Giovannini has been implementing purely functional data structures based on &lt;a href=&quot;http://www.cs.princeton.edu/~rs/strings/&quot;&gt;Bentley and Sedgewick's ternary search trees (TST)&lt;/a&gt; and pitching them against OCaml's imperative hash table (Hashtbl). In the interval between the &lt;a href=&quot;http://alaska-kamtchatka.blogspot.com/2009/06/algorithm-conscious-cache-oblivious.html&quot;&gt;initial results&lt;/a&gt; and his showing his &lt;a href=&quot;http://alaska-kamtchatka.blogspot.com/2009/06/simple-efficient-trie-maps.html&quot;&gt;first version&lt;/a&gt; (called Trie_map from now on), I made a functional TST (&amp;quot;Ternary&amp;quot;) which compares favorably to both Trie_map and the &lt;a href=&quot;http://alaska-kamtchatka.blogspot.com/2009/06/tree-nursery.html&quot;&gt;revised implementation&lt;/a&gt; (&amp;quot;Trie_map_mod&amp;quot;).&lt;/p&gt;&lt;p&gt;I also resuscitated a hash table implementation of mine that uses open addressing with double hashing (in contrast to Hashtbl's external chaining) and exhibits vastly improved performance.&lt;/p&gt;&lt;p&gt;I'll show the benchmark results first, to be followed by some code and analysis.&lt;/p&gt;&lt;h2&gt; Benchmarks&lt;/h2&gt;&lt;p&gt;I benchmark the following finite map implementations:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;imperative:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;Fasthashtbl: hash table with open addressing and double hashing&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Hashtbl: the hash table from INRIA's stdlib (external chaining)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Hashtbl_mod: Hashtbl with more aggressive resizing (lower load factor)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Hashtbl_hval: Hashtbl_mod with caching of the hash value&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;functional:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;Trie_map: TST with coalesced constructor for nodes with and without values&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Trie_map_mod: TST with different constructors for leaves and inner nodes with/without a value&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Ternary: TST with separate constructor for nodes with and without values but no leaf constructor&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Map: the Map implementation from INRIA's stdlib (AVL tree)&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;All the finite maps are compared by&lt;/p&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;building a map with 98568 English words, which are added in (a) randomized and (b) lexicographic order&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;measuring the minimal lookup cost with constant lookups (so that everything is in the L1 cache)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;timing successful lookups by looking for the 98568 words&lt;/p&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;in the same (random) order the elements were added&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;in randomized order&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;in lexicographic order&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;in reverse lexicographic order&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;... against the maps built in randomized and lexicographic order&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;timing searches against the previously generated maps (randomized and lexicographic order) using&lt;/p&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;a larger set of English words (217625, 45.3% hit rate)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;a set of Spanish words (86016 words, 1.3% hit rate)&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;... in randomized and lexicographic order&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;For the sake of exhaustiveness, I measure the iteration and functor overhead and detract it from the lookup timings. The total size of the structures (including the size of the strings) is also measured. All the measurements are repeated 20 times, and the best time is retained. The benchmarks are performed in a single program execution to achieve steady state (major heap resized to required size). The set of words used in the map is large enough to ensure the structure (taking 5.7MB to over 12MB) doesn't fit in the 2MB L2 cache.&lt;/p&gt;&lt;p&gt;The &lt;a href=&quot;http://eigenclass.org/static/finite-map-bm.txt&quot;&gt;full results can be found here&lt;/a&gt; (also includes hash table benchmark with integer keys).&lt;/p&gt;&lt;p&gt;This table shows the results for a map built in randomized order (refer to the full output for those in lexicographic order).  Speed is measured in ops/sec, and size in bytes (&amp;quot;Find (hit, orig)&amp;quot; means that the lookups are performed in the same random order as the insertions).&lt;/p&gt;&lt;h3&gt; Insertion (randomized order) and lookup&lt;/h3&gt;&lt;table&gt;
&lt;tbody&gt;

&lt;tr&gt;
&lt;th&gt;Data structure&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Load factor&lt;/th&gt;
&lt;th&gt;Insertion (rand)&lt;/th&gt;
&lt;th&gt;Find (constant)&lt;/th&gt;
&lt;th&gt;Find (hit, rand)&lt;/th&gt;
&lt;th&gt;Find (hit, orig)&lt;/th&gt;
&lt;th&gt;Find (hit, sorted)&lt;/th&gt;
&lt;th&gt;Find (45.3%, rand)&lt;/th&gt;
&lt;th&gt;Find (1.3%, rand)&lt;/th&gt;
&lt;th&gt;Find (1.3% sorted)&lt;/th&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;th&gt;FastHashtbl&lt;/th&gt;
&lt;td&gt;7325544&lt;/td&gt;
&lt;td&gt;0.376&lt;/td&gt;
&lt;td&gt;1509558&lt;/td&gt;
&lt;td&gt;16701222&lt;/td&gt;
&lt;td&gt;2669067&lt;/td&gt;
&lt;td&gt;3554728&lt;/td&gt;
&lt;td&gt;3958575&lt;/td&gt;
&lt;td&gt;2347248&lt;/td&gt;
&lt;td&gt;3243197&lt;/td&gt;
&lt;td&gt;5133702&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;th&gt;Hashtbl&lt;/th&gt;
&lt;td&gt;5752664&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;1105131&lt;/td&gt;
&lt;td&gt;5617227&lt;/td&gt;
&lt;td&gt;2136562&lt;/td&gt;
&lt;td&gt;2296294&lt;/td&gt;
&lt;td&gt;2791035&lt;/td&gt;
&lt;td&gt;1623434&lt;/td&gt;
&lt;td&gt;1731465&lt;/td&gt;
&lt;td&gt;2229713&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;th&gt;Hashtbl_mod&lt;/th&gt;
&lt;td&gt;7325528&lt;/td&gt;
&lt;td&gt;0.376&lt;/td&gt;
&lt;td&gt;639994&lt;/td&gt;
&lt;td&gt;7373058&lt;/td&gt;
&lt;td&gt;2711193&lt;/td&gt;
&lt;td&gt;2963744&lt;/td&gt;
&lt;td&gt;3893034&lt;/td&gt;
&lt;td&gt;2249186&lt;/td&gt;
&lt;td&gt;2926475&lt;/td&gt;
&lt;td&gt;4403349&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;th&gt;Hashtbl_hval&lt;/th&gt; &lt;!-- ds --&gt;
&lt;td&gt;8114072&lt;/td&gt; &lt;!-- size --&gt;
&lt;td&gt;0.376&lt;/td&gt; &lt;!-- alpha --&gt;
&lt;td&gt;675892&lt;/td&gt; &lt;!-- insert rand --&gt;
&lt;td&gt;10457458&lt;/td&gt; &lt;!-- constant --&gt;
&lt;td&gt;2769387&lt;/td&gt; &lt;!-- hit --&gt;
&lt;td&gt;3056772&lt;/td&gt; &lt;!-- hit, same --&gt;
&lt;td&gt;4153404&lt;/td&gt; &lt;!-- hit, sorted --&gt;
&lt;td&gt;2435018&lt;/td&gt; &lt;!-- 45.3 --&gt;
&lt;td&gt;3436621&lt;/td&gt; &lt;!-- 1.3 rand --&gt;
&lt;td&gt;5542555&lt;/td&gt; &lt;!-- 1.3 sorted --&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;th&gt;Ternary&lt;/th&gt; &lt;!-- ds --&gt;
&lt;td&gt;11894440&lt;/td&gt; &lt;!-- size --&gt;
&lt;td&gt;&lt;/td&gt; &lt;!-- alpha --&gt;
&lt;td&gt;179599&lt;/td&gt; &lt;!-- insert rand --&gt;
&lt;td&gt;14739138&lt;/td&gt; &lt;!-- constant --&gt;
&lt;td&gt;1484213&lt;/td&gt; &lt;!-- hit --&gt;
&lt;td&gt;1556592&lt;/td&gt; &lt;!-- hit, same --&gt;
&lt;td&gt;4576735&lt;/td&gt; &lt;!-- hit, sorted --&gt;
&lt;td&gt;1567019&lt;/td&gt; &lt;!-- 45.3 --&gt;
&lt;td&gt;2419841&lt;/td&gt; &lt;!-- 1.3 rand --&gt;
&lt;td&gt;6466320&lt;/td&gt; &lt;!-- 1.3 sorted --&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;th&gt;Trie_map&lt;/th&gt; &lt;!-- ds --&gt;
&lt;td&gt;&lt;/td&gt; &lt;!-- size --&gt;
&lt;td&gt;&lt;/td&gt; &lt;!-- alpha --&gt;
&lt;td&gt;99834&lt;/td&gt; &lt;!-- insert rand --&gt;
&lt;td&gt;11870073&lt;/td&gt; &lt;!-- constant --&gt;
&lt;td&gt;985476&lt;/td&gt; &lt;!-- hit --&gt;
&lt;td&gt;1024257&lt;/td&gt; &lt;!-- hit, same --&gt;
&lt;td&gt;3022102&lt;/td&gt; &lt;!-- hit, sorted --&gt;
&lt;td&gt;1085412&lt;/td&gt; &lt;!-- 45.3 --&gt;
&lt;td&gt;1797166&lt;/td&gt; &lt;!-- 1.3 rand --&gt;
&lt;td&gt;5763597&lt;/td&gt; &lt;!-- 1.3 sorted --&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;th&gt;Trie_map_mod&lt;/th&gt; &lt;!-- ds --&gt;
&lt;td&gt;12417768&lt;/td&gt; &lt;!-- size --&gt;
&lt;td&gt;&lt;/td&gt; &lt;!-- alpha --&gt;
&lt;td&gt;164725&lt;/td&gt; &lt;!-- insert rand --&gt;
&lt;td&gt;8476101&lt;/td&gt; &lt;!-- constant --&gt;
&lt;td&gt;1284442&lt;/td&gt; &lt;!-- hit --&gt;
&lt;td&gt;1340733&lt;/td&gt; &lt;!-- hit, same --&gt;
&lt;td&gt;3246878&lt;/td&gt; &lt;!-- hit, sorted --&gt;
&lt;td&gt;1417483&lt;/td&gt; &lt;!-- 45.3 --&gt;
&lt;td&gt;2151898&lt;/td&gt; &lt;!-- 1.3 rand --&gt;
&lt;td&gt;4811496&lt;/td&gt; &lt;!-- 1.3 sorted --&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;th&gt;Map&lt;/th&gt; &lt;!-- ds --&gt;
&lt;td&gt;6805440&lt;/td&gt; &lt;!-- size --&gt;
&lt;td&gt;&lt;/td&gt; &lt;!-- alpha --&gt;
&lt;td&gt;167046&lt;/td&gt; &lt;!-- insert rand --&gt;
&lt;td&gt;1168963&lt;/td&gt; &lt;!-- constant --&gt;
&lt;td&gt;674209&lt;/td&gt; &lt;!-- hit --&gt;
&lt;td&gt;671368&lt;/td&gt; &lt;!-- hit, same --&gt;
&lt;td&gt;1026645&lt;/td&gt; &lt;!-- hit, sorted --&gt;
&lt;td&gt;556788&lt;/td&gt; &lt;!-- 45.3 --&gt;
&lt;td&gt;633682&lt;/td&gt; &lt;!-- 1.3 rand --&gt;
&lt;td&gt;986727&lt;/td&gt; &lt;!-- 1.3 sorted --&gt;
&lt;/tr&gt;

&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt; Implementation notes&lt;/h2&gt;&lt;p&gt;The &lt;a href=&quot;http://github.com/mfp/ocaml-finite-maps/tree/master&quot;&gt;source code for all the containers and the benchmarks can be obtained here&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;I modified Hashtbl first by changing its resizing policy so that the load factor is the same as in Fasthashtbl (Hashtbl_mod), and second by caching the hash value in the nodes (Hashtbl_hval), which makes successful lookups faster when there are collisions (since the key comparison can be bypassed when the hash values differ) and helps a lot with unsuccessful ones.&lt;/p&gt;&lt;p&gt;Fasthashtbl, Hashtbl_mod and Hashtbl_hval only allow the load factors to grow to 50% before resizing the underlying array to twice its former size; the average load factor is thus 37.5%. Hashtbl, on the other hand, allows it to grow to 2.0, which, theory predicts, would require 13.4 probes per unsuccessful search, and 4.6 per successful one (Knuth, TAOCP Vol 3, Section 6.4, page 524). For a 37.5% load factor, Hashtbl, Hashtbl_mod and Hashtbl_hval will require 1.1 and 1.2, and Fasthashtbl (which uses open addressing with double hashing) will need 1.6 and 1.25 --- double hashing is about as good for successful lookups but worse at unsuccessful ones.&lt;/p&gt;&lt;p&gt;Trie_map_mod and Ternary are very similar: the only differences are that the latter doesn't have a separate Leaf node type, and its code has undergone a few manual optimizations explained below.&lt;/p&gt;&lt;h2&gt; Some observations&lt;/h2&gt;&lt;h3&gt; Functional vs. imperative&lt;/h3&gt;&lt;p&gt;&lt;a href=&quot;http://eigenclass.org/R2/writings/finite-map-benchmarks&quot;&gt;Read more...&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>
Fri, 19 Jun 2009 10:16:51 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/finite-map-benchmarks
</guid>
</item>
<item>
<title>
Making mutable state faster: optimizing the caml_modify write barrier
</title>
<link>
http://eigenclass.org/R2/writings/optimizing-caml_modify
</link>
<description>
&lt;p&gt;As &lt;a href=&quot;http://eigenclass.org/R2/writings/write-barrier-cost&quot;&gt;I explained yesterday&lt;/a&gt;, the &lt;code&gt;caml_modify&lt;/code&gt; write barrier needed by OCaml's incremental + generational GC is the main reason why immutable data can often be faster than mutable state. I've played a little with it and achieved a &amp;gt;50% speed boost in synthetic benchmarks, which could map to up to two-digit percentage gains in actual programs (in particular, it can make &lt;code&gt;Array.map&lt;/code&gt;, &lt;code&gt;init&lt;/code&gt;, &lt;code&gt;make&lt;/code&gt; and the likes over 20% faster). The mutable state part from the synthetic business entity update benchmark becomes 32% faster.&lt;/p&gt;&lt;h2&gt; How the write barrier works&lt;/h2&gt;&lt;p&gt;&lt;code&gt;caml_modify&lt;/code&gt; needs to record the references from values in the major heap to the minor heap. The code is straightforward:&lt;/p&gt;&lt;pre&gt;#define Modify(fp, val) do{                                                 \
  value _old_ = *(fp);                                                      \
  *(fp) = (val);                                                            \
  if (Is_in_heap (fp)){                                                     \
    if (caml_gc_phase == Phase_mark) caml_darken (_old_, NULL);             \
    if (Is_block (val) &amp;amp;&amp;amp; Is_young (val)                                    \
        &amp;amp;&amp;amp; ! (Is_block (_old_) &amp;amp;&amp;amp; Is_young (_old_))){                       \
      if (caml_ref_table.ptr &amp;gt;= caml_ref_table.limit){                      \
        CAMLassert (caml_ref_table.ptr == caml_ref_table.limit);            \
        caml_realloc_ref_table (&amp;amp;caml_ref_table);                           \
      }                                                                     \
      *caml_ref_table.ptr++ = (fp);                                         \
    }                                                                       \
  }                                                                         \
}while(0)
&lt;/pre&gt;&lt;p&gt;&lt;code&gt;fp&lt;/code&gt; is the block slot where the (suspected) pointer is to be stored. &lt;code&gt;caml_modify&lt;/code&gt; first checks if the destination is in the major heap with the &lt;code&gt;Is_in_heap&lt;/code&gt; macro I'll come back to in a minute. &lt;code&gt;Is_block(v)&lt;/code&gt; is true when &lt;code&gt;v&lt;/code&gt; is not an immediate value. There's an interesting piece of logic that tries to avoid recording the same reference twice: if the previous reference &lt;code&gt;_old_&lt;/code&gt; held in &lt;code&gt;fp&lt;/code&gt; was to a block in the minor heap, we know it's been already added to the ref table and there's no need to do it again. In the worst case, when updating &lt;code&gt;N&lt;/code&gt; times the same slot with values in the minor and the major heap alternatingly, the reference will be recorded &lt;code&gt;N/2&lt;/code&gt; times --- it seems very unlikely this would happen in practice, though, as the ref table is cleared on each minor GC run.&lt;/p&gt;&lt;p&gt;The main problem with the above routine is that &lt;code&gt;Is_in_heap&lt;/code&gt; is fairly slow. On 64-bit architectures it involves a lookup in a sparse hash table to see if the corresponding page is being managed by OCaml (on 32 bit, a two-level array is used). The reason why this check cannot just be implemented as a negated &amp;quot;is it in the minor heap&amp;quot; test (two pointer comparisons at most) is that it is possible to have blocks which belong neither to the minor heap nor to the major one --- iow., memory areas not managed by OCaml at all, and which are not to be traversed by the GC.&lt;/p&gt;&lt;h2&gt; Speeding it up&lt;/h2&gt;&lt;p&gt;When we're not in the &lt;code&gt;Phase_mark&lt;/code&gt; GC phase, the order of the checks can be altered so the expensive &lt;code&gt;Is_in_heap&lt;/code&gt; test is only performed when the value to be stored resides in the minor heap &lt;em&gt;and&lt;/em&gt; the old one didn't before.  This way, when the same slot is being updated several times, &lt;code&gt;Is_in_heap&lt;/code&gt; will only be used once (with the exception explained above).&lt;/p&gt;&lt;p&gt;The resulting code, albeit quite ugly because of the code duplication, is considerably faster. I measure the gain with two trivial programs:&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;http://eigenclass.org/R2/writings/optimizing-caml_modify&quot;&gt;Read more...&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>
Fri, 29 May 2009 15:16:50 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/optimizing-caml_modify
</guid>
</item>
<item>
<title>
When immutable data is faster: write barrier costs
</title>
<link>
http://eigenclass.org/R2/writings/write-barrier-cost
</link>
<description>
&lt;p&gt;&lt;a href=&quot;http://alaska-kamtchatka.blogspot.com/2009/05/is-immutable-data-slow.html&quot;&gt;This example&lt;/a&gt; shows that in some cases immutable data can be faster than mutable state. One reason is that language implementations with incremental, generational or concurrent GCs (a few combine all three properties at once, like HotSpot's; OCaml's is only generational and incremental) typically use write barriers which add overhead to each heap pointer store (read barriers are normally more expensive). In particular, generational GCs use write barriers to maintain the remembered set: a list of references from an older generation to a younger one.&lt;/p&gt;&lt;p&gt;The synthetic benchmark operates on &amp;quot;simple records intended to emulate typical business entities&amp;quot;:&lt;/p&gt;&lt;pre&gt;module P0 = struct
  type person = { name: string; age: int; balance: float }
end

module P1 = struct
  type person = { name: string; mutable age: int; mutable balance: float }
end
&lt;/pre&gt;&lt;p&gt;The &lt;code&gt;balance&lt;/code&gt; field is updated repeatedly, by creating a new record in the immutable case, and by changing the field directly in the mutable one. &lt;code&gt;float&lt;/code&gt;s are generally boxed in OCaml (the main exceptions are float arrays and records holding only floats), so a &lt;code&gt;person&lt;/code&gt; record takes 6 words on x86-64 (7 on x86, since the double value takes 8 bytes). In the first case, each iteration allocates 48(28) bytes; in the second one, only 16(12) are allocated, but the field update incurs into the write barrier overhead: a call to the &lt;code&gt;caml_modify&lt;/code&gt; routine, that checks whether a pointer to a value in the younger generation is being stored in a block residing in the old (major) heap. As it turns out, the extra allocation is faster than the write barrier:&lt;/p&gt;&lt;pre&gt;Immutable record: 138644594.442718 updates/s
Mutable record:   79740984.161326 updates/s
&lt;/pre&gt;&lt;p&gt;The example can be modified to illustrate clearly the effect of the write barrier, and when the latter is needed at all.&lt;/p&gt;&lt;p&gt;If the &lt;code&gt;balance&lt;/code&gt; field is changed to an &lt;code&gt;int&lt;/code&gt;, the stored value is known not to reside in the minor heap (integers, bools, chars and constant constructors are immediate), so the write barrier is not needed, and no call to &lt;code&gt;caml_modify&lt;/code&gt; will be generated. Moreover, there will be no allocation at all in the mutable case, since there are no boxed floats involved anymore, so we can compare the cost of allocating a 4-word record to that of updating a single field:&lt;/p&gt;&lt;pre&gt;Immutable record: 277762346.536304 updates/s
Mutable record:   499975001.249938 updates/s
&lt;/pre&gt;&lt;p&gt;Similarly, the &lt;code&gt;balance&lt;/code&gt; field can be &amp;quot;boxed manually&amp;quot;, so to speak, to avoid allocations in the mutable case, by making it of type&lt;/p&gt;&lt;pre&gt;type ufloat = { mutable value : float }
&lt;/pre&gt;&lt;p&gt;This takes exactly as much space as a regular boxed float would, but allows to update the value without any allocation in the mutable case:&lt;/p&gt;&lt;pre&gt;Immutable record: 147050173.519205 updates/s
Mutable record:   277762346.536304 updates/s
&lt;/pre&gt;&lt;p&gt;Finally, if all other fields are removed, we are left with&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;http://eigenclass.org/R2/writings/write-barrier-cost&quot;&gt;Read more...&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>
Thu, 28 May 2009 10:49:12 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/write-barrier-cost
</guid>
</item>
<item>
<title>
Looking for a message broker; semantics of the RabbitMQ STOMP adapter
</title>
<link>
http://eigenclass.org/R2/writings/rabbitmq-STOMP-semantics
</link>
<description>
&lt;p&gt;Albeit under-specified and limited in functionality, &lt;a href=&quot;http://stomp.codehaus.org/Protocol&quot;&gt;STOMP&lt;/a&gt; is made attractive by its utter simplicity if you're OK with basic message queues. Yet, there are &lt;a href=&quot;http://stomp.codehaus.org/#Home-StompBrokers&quot;&gt;very few usable STOMP brokers&lt;/a&gt;. Indeed, the &lt;a href=&quot;http://wiki.secondlife.com/wiki/Message_Queue_Evaluation_Notes&quot;&gt;list seems to boil down to&lt;/a&gt; RabbitMQ. (The dissonance between ActiveMQ's overall &amp;quot;enterprise-ness&amp;quot; and its inability to manage more than a few hundred queues at once is remarkable.) I'm not especially attached to STOMP or RabbitMQ, but it's the simplest way I've found so far to get what I want:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;persistent queues&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;topics (messages broadcast to all subscribers)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;blocking receive (no polling)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;message acknowledgement&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;scaling over many queues/topics&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;simplicity! Self-contained executable with small footprint preferable to huge ball of mud.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;optional NACK and message locking (delivery of message unless ACKed within a given period)&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;RabbitMQ + STOMP satisfies all but the (two?) last item(s). I have yet to determine if it's possible to configure queues so that messages are considered dropped (and thus redelivered) if not acknowledged within a specified period of time. Is this doable in RabbitMQ (configuring timeout periods externally is OK)?&lt;/p&gt;&lt;p&gt;I read somewhere that RabbitMQ took some ~5KLoC of Erlang, so it shouldn't take much more than maybe 2KLoC or so worth of OCaml to implement the above, since I'm leaving out:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;scaling over thousands of clients&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;extremely high message rates (thousands per second is amply sufficient)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;queue replication&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;For now, I'm giving RabbitMQ a try.  Unfortunately, &lt;a href=&quot;http://www.lshift.net/blog/2008/02/02/how-to-run-rabbitmqs-experimental-stomp-adapter&quot;&gt;RabbitMQ's STOMP adapter&lt;/a&gt; is also &lt;a href=&quot;https://dev.rabbitmq.com/wiki/StompGateway&quot;&gt;under-documented&lt;/a&gt;, so I had to discover most of the semantics by trial and error. Here's what I found.&lt;/p&gt;&lt;h2&gt; Frames&lt;/h2&gt;&lt;p&gt;Frames are delimited by a NUL character (&lt;code&gt;\000&lt;/code&gt;), &lt;b&gt;not&lt;/b&gt; a NUL character followed by a newline (&lt;code&gt;\000\n&lt;/code&gt;), unlike ActiveMQ. Headers used to be of the form &lt;code&gt;key:value&lt;/code&gt; (with no space), but the latest version trims leading spaces.&lt;/p&gt;&lt;h2&gt; Queues&lt;/h2&gt;&lt;p&gt;Queues can have the &amp;quot;auto-delete&amp;quot; property, which means that the queue (and thus all its messages) will disappear when the last client subscribed to it unsubscribes or disconnects. Queues are automatically created the first time they are subscribed to. This is how you create/subscribe to a persistent queue:&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;http://eigenclass.org/R2/writings/rabbitmq-STOMP-semantics&quot;&gt;Read more...&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>
Tue, 19 May 2009 10:58:16 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/rabbitmq-STOMP-semantics
</guid>
</item>
<item>
<title>
Markdown redux, and walking the AST to generate (statically) valid XHTML
</title>
<link>
http://eigenclass.org/R2/writings/markdown-redux-html-generation
</link>
<description>
&lt;p&gt;My earlier post (about &lt;a href=&quot;http://eigenclass.org/R2/writings/fast-extensible-simplified-markdown-in-ocaml&quot;&gt;Markdown processors and how simple top-down parsers&lt;/a&gt; can be up to two or three orders of magnitude faster than regexp-based gsub kludges) got unexpectedly popular.  Some critics accused me of deliberately bastardizing Markdown's syntax for the sake of implementation convenience and speed. In order to prove that this is not the case, I reverted the changes relative to Markdown I introduced with ease of writing in mind and measured again. The exact changes are listed below. According to &lt;code&gt;git&lt;/code&gt;, the &lt;a href=&quot;http://github.com/mfp/ocsiblog/blob/832503a70e45ad9a2e5a688c87014e570398f081/simple_markup.ml&quot;&gt;resulting code&lt;/a&gt; is in fact &lt;em&gt;shorter&lt;/em&gt;:&lt;/p&gt;&lt;pre&gt;$ git diff --stat -b 298c05 simple_markup.ml
 simple_markup.ml |  167 +++++++++++++++++++++++++++---------------------------
 1 files changed, 83 insertions(+), 84 deletions(-)
&lt;/pre&gt;&lt;p&gt;I think this, if anything, shows that the motivation behind my changes &lt;em&gt;was not&lt;/em&gt; to simplify the implementation and make it shorter.  As for the performance, the 330KB document I was using is processed in 50ms (vs. 63ms for &lt;code&gt;discount&lt;/code&gt;, 2.7s for Pandoc, 2.1s for &lt;code&gt;python-markdown&lt;/code&gt; or 30s for Bluecloth).&lt;/p&gt;&lt;h2&gt; Transforming the AST into (statically) valid XHTML&lt;/h2&gt;&lt;p&gt;As I said in my earlier post, I quickly recognized the advantages of going beyond mere &amp;quot;markup -&amp;gt; HTML&amp;quot; conversion and being able to manipulate the AST. New targets can be supported in a &lt;em&gt;very&lt;/em&gt; compact way; for instance, (X)HTML generation takes only 40 lines of code. The code is clear and concise thanks to the use of pattern matching (it's also fast :). Moreover, I can use higher-order functions to allow for specific processing of &amp;quot;interesting&amp;quot; elements. For instance, I'm rendering pre-formatted blocks, links and images differently depending on whether the markup comes from a comment or was written by me (e.g. some extensions are allowed only in the latter case, and only the &lt;code&gt;alt&lt;/code&gt; attribute is shown in the former).&lt;/p&gt;&lt;p&gt;The code below simply generates a typed HTML tree from the markup tree. I'm using &lt;a href=&quot;http://ocsigen.org&quot;&gt;Ocsigen&lt;/a&gt;'s &lt;code&gt;XHTML.M&lt;/code&gt; module, which employs OCaml's type system to ensure that the resulting HTML is valid. Invalid markup just doesn't compile, as you'd get a type error; e.g., if I try to express something invalid like &lt;code&gt;&amp;lt;b&amp;gt;&amp;lt;li&amp;gt;foo&amp;lt;/li&amp;gt;&amp;lt;/b&amp;gt;&lt;/code&gt;, the type system will catch it and tell me the acceptable elements:&lt;/p&gt;&lt;pre&gt;# b [li [pcdata &amp;quot;foo&amp;quot;]];;
     -----------------
Error: This expression has type ([&amp;gt; `Li ] as 'a) elt = 'a XHTML.M.elt
       but is here used with type
        [&amp;lt; `A | `Abbr | `Acronym | `B | `Bdo | `Big | `Br | `Button
        | `Cite | `Code | `Del | `Dfn | `Em | `I | `Img | `Input | `Ins
        | `Kbd | `Label | `Map | `Noscript | `Object | `PCDATA | `Q
        | `Samp | `Script | `Select | `Small | `Span | `Strong | `Sub
        | `Sup | `Textarea | `Tt | `Var ] elt = 'b XHTML.M.elt
       The second variant type does not allow tag(s) `Li
&lt;/pre&gt;&lt;p&gt;The compiler (or toplevel) not only tells me the HTML would not be valid (&lt;code&gt;li&lt;/code&gt; not allowed inside &lt;code&gt;b&lt;/code&gt;), but also lists the elements I can use there.&lt;/p&gt;&lt;p&gt;Here are the ~40 lines of code that generate the HTML on this site (the resulting tree is included in a larger, typed tree --- ensuring the end result is valid when the page is composed). There are no type annotations whatsoever, as OCaml is able to infer everything here (OTOH, I can get the type of any single expression with a couple keypresses in vim; you can also generate the interface with the type of all the functions with &lt;code&gt;ocamlc -i&lt;/code&gt;):&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;http://eigenclass.org/R2/writings/markdown-redux-html-generation&quot;&gt;Read more...&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>
Sun, 12 Apr 2009 20:19:24 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/markdown-redux-html-generation
</guid>
</item>
<item>
<title>
On the sad state of markdown processors, and getting thousandfold speed-ups.
</title>
<link>
http://eigenclass.org/R2/writings/fast-extensible-simplified-markdown-in-ocaml
</link>
<description>
&lt;p&gt;When I started to write the code for the latest incarnation of eigenclass, I was planning to use an existent Markdown processor to generate the HTML for the posts and comments dynamically. That'd take at most a couple lines to pipe the markup to a process and read back the HTML. I took the first Markdown implementation that came to mind, Bluecloth (written in Ruby) and ran it against a few documents. I was most underwhelmed by its speed. It was so slow it'd need over one or two seconds to process some of the entries I've written since.  I benchmarked other common implementations, the original &lt;code&gt;markdown&lt;/code&gt; (in Perl) and &lt;code&gt;python-markdown&lt;/code&gt;, and realized that they were only marginally better.  At the risk of being perceived as performance-obsessed, here's the observed performance when processing &lt;code&gt;markdown&lt;/code&gt;'s &lt;code&gt;README&lt;/code&gt; (&lt;code&gt;README.n&lt;/code&gt; is &lt;code&gt;README&lt;/code&gt; concatenated &lt;code&gt;n&lt;/code&gt; times) on a 3GHZ AMD64 box (much faster than the old server running this site):&lt;/p&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;language&lt;/th&gt;
&lt;th&gt;LoCs (approx.)&lt;/th&gt;
&lt;th&gt;README.1 time&lt;/th&gt;
&lt;th&gt;README.8 time&lt;/th&gt;
&lt;th&gt;README.32 time&lt;/th&gt;
&lt;th&gt;README.32 MEM&lt;/th&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;Bluecloth&lt;/td&gt;
&lt;td&gt;Ruby&lt;/td&gt;
&lt;td&gt;1100&lt;/td&gt;
&lt;td&gt;0.130s&lt;/td&gt;
&lt;td&gt;2.16s&lt;/td&gt;
&lt;td&gt;30s&lt;/td&gt;
&lt;td&gt;31MB&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;markdown&lt;/td&gt;
&lt;td&gt;Perl&lt;/td&gt;
&lt;td&gt;1400&lt;/td&gt;
&lt;td&gt;0.068s&lt;/td&gt;
&lt;td&gt;0.66s&lt;/td&gt;
&lt;td&gt;segfault&lt;/td&gt;
&lt;td&gt;segfault&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;python-markdown&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;1900&lt;/td&gt;
&lt;td&gt;0.090s&lt;/td&gt;
&lt;td&gt;0.35s&lt;/td&gt;
&lt;td&gt;2.06s&lt;/td&gt;
&lt;td&gt;23MB&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;Pandoc&lt;/td&gt;
&lt;td&gt;Haskell&lt;/td&gt;
&lt;td&gt;900 + 450&lt;/td&gt;
&lt;td&gt;0.068s&lt;/td&gt;
&lt;td&gt;0.55s&lt;/td&gt;
&lt;td&gt;2.7s&lt;/td&gt;
&lt;td&gt;25MB&lt;/td&gt;
&lt;/tr&gt;

&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Compare to the rather acceptable performance of my own &lt;a href=&quot;http://github.com/mfp/ocsiblog/blob/298c05b845016ce3b367efbed71aaa0849101ded/simple_markup.ml&quot;&gt;Simple_markup module in OCaml&lt;/a&gt;, and of &lt;code&gt;discount&lt;/code&gt;, a C implementation I found when I had already written mine:&lt;/p&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;language&lt;/th&gt;
&lt;th&gt;LoCs (approx.)&lt;/th&gt;
&lt;th&gt;README.8 time&lt;/th&gt;
&lt;th&gt;README.32 time&lt;/th&gt;
&lt;th&gt;README.32 MEM&lt;/th&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;Simple_markup&lt;/td&gt;
&lt;td&gt;OCaml&lt;/td&gt;
&lt;td&gt;313 + 55&lt;/td&gt;
&lt;td&gt;12ms&lt;/td&gt;
&lt;td&gt;43ms&lt;/td&gt;
&lt;td&gt;3.5MB&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;discount&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;~4500&lt;/td&gt;
&lt;td&gt;16ms&lt;/td&gt;
&lt;td&gt;63ms&lt;/td&gt;
&lt;td&gt;2.8MB&lt;/td&gt;
&lt;/tr&gt;

&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;(The LoC counts for Simple_markup and Pandoc are split into parsing and HTML generation.)&lt;/p&gt;&lt;p&gt;I didn't do any attempt to optimize Simple_markup beyond replacing a single O(n^2) call to &lt;code&gt;String.nsplit&lt;/code&gt; with a O(n) &lt;code&gt;Str.split&lt;/code&gt; one in order to split the input string into lines. I'm not compiling with &lt;code&gt;-unsafe&lt;/code&gt; or &lt;code&gt;-nodynlink&lt;/code&gt; either.&lt;/p&gt;&lt;p&gt;To add insult to injury, Bluecloth, &lt;code&gt;markdown&lt;/code&gt; and python-markdown are ugly hacks that boil down to iterated regexp-based &lt;code&gt;gsub&lt;/code&gt;s. I can see why they have a long history of bugs: it is easy for a gsub pass to interfere accidentally with another, and such regexp-based transformations are full of corner cases.&lt;/p&gt;&lt;p&gt;A much cleaner approach is to parse the markup into a parse tree, and then generate the (X)HTML in a separate pass. This is what Pandoc, discount and Simple_markup do.&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;http://eigenclass.org/R2/writings/fast-extensible-simplified-markdown-in-ocaml&quot;&gt;Read more...&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>
Tue, 07 Apr 2009 10:04:19 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/fast-extensible-simplified-markdown-in-ocaml
</guid>
</item>
<item>
<title>
MiniLight renderer cleanup, simpler (safer) code is faster
</title>
<link>
http://eigenclass.org/R2/writings/minilight-cleanup
</link>
<description>
&lt;p&gt;&lt;a href=&quot;http://www.hxa.name/minilight/&quot;&gt;MiniLight&lt;/a&gt; is a minimal global illumination renderer that has been translated to 7 languages and takes on average 570 lines of code. Neat! It's of course accompanied by some benchmarks.  The results are unsurprising: C++ is the largest (951 lines) and fastest, followed by OCaml (461 LoC, 2.7X slower on x86 --- this is relevant because it's faster on AMD64) and Scala (427 LoC, 6.7X slower). The slowest versions are unsurprisingly Ruby (498 LoC, &lt;em&gt;575X slower&lt;/em&gt;) and Python (490 LoC, 173X slower); these languages have a hard time here, not being well-suited for numerical nor symbolic processing (both in terms of &lt;a href=&quot;http://www.reddit.com/r/programming/comments/86wko/minilight_a_minimal_global_illumination_renderer/c08fdze&quot;&gt;implementation effort&lt;/a&gt; and efficiency).&lt;/p&gt;&lt;p&gt;I took a look at the OCaml code and couldn't resist the temptation to clean it up a bit (I did &lt;em&gt;not&lt;/em&gt; optimize it thoroughly, though, so there's still ample optimization potential --- the same can be said of all implementations, I suspect). You can find the &lt;a href=&quot;http://github.com/mfp/ocaml-minilight/&quot;&gt;modified MiniLight code here&lt;/a&gt;. I have only polished some parts and there are still several things I'd write differently. There are also some style choices which I think are objectively wrong (confusing indentation), and I don't see the point of documenting the types in comments: OCaml's &lt;code&gt;ocamldoc&lt;/code&gt; tool is able to infer and present them along with the rest of the documentation automatically, and I can get the type of any expression with a couple keypresses in vim...&lt;/p&gt;&lt;p&gt;I essentially massaged the parts that screamed for a cleanup (some bad practices explained below), making the code 17 lines shorter in the process and the program substantially faster. I then realized that whereas the C++ version was using a very fast custom PRNG, the OCaml translation utilized the stdlib one, with an appreciable effect on the speed:&lt;/p&gt;&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;LoCs&lt;/th&gt;
&lt;th&gt;smits-large processing time&lt;/th&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;C++&lt;/td&gt;
&lt;td&gt;951&lt;/td&gt;
&lt;td&gt;15.6&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;OCaml orig&lt;/td&gt;
&lt;td&gt;461&lt;/td&gt;
&lt;td&gt;24.5&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;OCaml, idiomatic pattern matching&lt;/td&gt;
&lt;td&gt;448&lt;/td&gt;
&lt;td&gt;15.8&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;OCaml, simpler record type for triangle&lt;/td&gt;
&lt;td&gt;444&lt;/td&gt;
&lt;td&gt;14.7&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td&gt;OCaml, custom PRNG&lt;/td&gt;
&lt;td&gt;452&lt;/td&gt;
&lt;td&gt;11.6&lt;/td&gt;
&lt;/tr&gt;

&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;(AMD64, GCC 4.3.2, OCaml 3.11.0 with -nodynlink)&lt;/p&gt;&lt;p&gt;By using more idiomatic pattern matching and option types, the OCaml version goes from being 57% slower to just as fast as the C++ one, with fewer lines of code and more static safety (see below). With the same PRNG as the C++, the OCaml version is now 34% faster than the C++ one. Both admit doubtlessly further optimizations. Algorithmic ones will be easier to express in OCaml, but C++ will have the edge in terms of low-level code optimization.&lt;/p&gt;&lt;h2&gt; Anti-pattern: &lt;code&gt;if&lt;/code&gt; instead of pattern matching&lt;/h2&gt;&lt;p&gt;Most of the speed increase was brought by moving from ugly &lt;code&gt;if&lt;/code&gt; expressions to idiomatic pattern matching. The raytracer had code like&lt;/p&gt;&lt;pre&gt;let emissionIn =
   if (Triangle.Null = hitObject) || ((Triangle.Obj emitter) = hitObject) then
     ...
   else
     vZero
in
&lt;/pre&gt;&lt;p&gt;This feels wrong: when you have a variant type, you definitely want to pattern match, not to use &lt;code&gt;if&lt;/code&gt;. As it turns out, the above code is also slower than pattern matching for two reasons:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;&lt;code&gt;Triangle.Obj emitter&lt;/code&gt; allocates a block in the heap that will be dead right away&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;it uses structural comparison (meaning that the object/record fields are compared one by one) instead of identity comparison&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;This is a simpler, more idiomatic way to write the above, which is also much faster (I use the standard option type instead of a new one that serves no purpose):&lt;/p&gt;&lt;pre&gt;let emissionIn = match hitObject with
    Some emitter' when emitter' != emitter -&amp;gt; vZero
  | _ -&amp;gt; ...
&lt;/pre&gt;&lt;p&gt;It's easier to read too, as you don't have to analyze two boolean expressions, but only one.&lt;/p&gt;&lt;h2&gt; Anti-pattern: tuples with boolean flags instead of option types&lt;/h2&gt;&lt;p&gt;&lt;a href=&quot;http://eigenclass.org/R2/writings/minilight-cleanup&quot;&gt;Read more...&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>
Mon, 30 Mar 2009 11:00:37 +0000
</pubDate>
<guid isPermaLink="true">
http://eigenclass.org/R2/writings/minilight-cleanup
</guid>
</item>
</channel>
</rss>
