When I started to write the code for the latest incarnation of eigenclass, I was planning to use an existent Markdown processor to generate the HTML for the posts and comments dynamically. That'd take at most a couple lines to pipe the markup to a process and read back the HTML. I took the first Markdown implementation that came to mind, Bluecloth (written in Ruby) and ran it against a few documents. I was most underwhelmed by its speed. It was so slow it'd need over one or two seconds to process some of the entries I've written since. I benchmarked other common implementations, the original markdown (in Perl) and python-markdown, and realized that they were only marginally better. At the risk of being perceived as performance-obsessed, here's the observed performance when processing markdown's README (README.n is README concatenated n times) on a 3GHZ AMD64 box (much faster than the old server running this site):

language LoCs (approx.) README.1 time README.8 time README.32 time README.32 MEM
Bluecloth Ruby 1100 0.130s 2.16s 30s 31MB
markdown Perl 1400 0.068s 0.66s segfault segfault
python-markdown Python 1900 0.090s 0.35s 2.06s 23MB
Pandoc Haskell 900 + 450 0.068s 0.55s 2.7s 25MB

Compare to the rather acceptable performance of my own Simple_markup module in OCaml, and of discount, a C implementation I found when I had already written mine:

language LoCs (approx.) README.8 time README.32 time README.32 MEM
Simple_markup OCaml 313 + 55 12ms 43ms 3.5MB
discount C ~4500 16ms 63ms 2.8MB

(The LoC counts for Simple_markup and Pandoc are split into parsing and HTML generation.)

I didn't do any attempt to optimize Simple_markup beyond replacing a single O(n^2) call to String.nsplit with a O(n) Str.split one in order to split the input string into lines. I'm not compiling with -unsafe or -nodynlink either.

To add insult to injury, Bluecloth, markdown and python-markdown are ugly hacks that boil down to iterated regexp-based gsubs. I can see why they have a long history of bugs: it is easy for a gsub pass to interfere accidentally with another, and such regexp-based transformations are full of corner cases.

A much cleaner approach is to parse the markup into a parse tree, and then generate the (X)HTML in a separate pass. This is what Pandoc, discount and Simple_markup do.

Recognized markup

Simple_markup deviates from Markdown in a few ways:

Some comments about the implementation

The bulk of the code parses a stream of lines (I use Extlib/Batteries' Enum) into a list of paragraphs, defined as:

type paragraph =
    Normal of par_text
  | Pre of string * string option
  | Heading of int * par_text
  | Quote of paragraph list
  | Ulist of paragraph list * paragraph list list
  | Olist of paragraph list * paragraph list list

and par_text = text list

and text =
    Text of string
  | Emph of string
  | Bold of string
  | Struck of par_text
  | Code of string
  | Link of href
  | Anchor of string
  | Image of img_ref

and href = { href_target : string; href_desc : string; }

and img_ref = { img_src : string; img_alt : string; }

The Ulist and Olist constructors take the first item followed by a (possible empty) list of items, to prevent empty lists --- this way, there's at least one element.

Note that Emph and Bold don't contain more (styled) text, but only strings. It'd be rather easy to allow this by using polymorphic variants, as follows:

and unstyled_text = [`Text of string]
and simple_text = [unstyled_text | `Emph of simple_text
                  |`Bold of simple_text | `Struck of simple_text]
and text = [ simple_text | `Code of string | ... ]

The same goes for styled text inside links. I chose not to do it, though, because bold emphasized text is just poor style.

The top-down parser takes 313 LoC and is not very complex, in spite of markdown's use of significant indentation. At its core, it's just a set of recursive functions that take an enumeration of (int * string * bool) tuples with the indentation, the string and whether the line is empty, for convenience:

let parse_enum e =
  collect (read_paragraph 0)
    (Enum.map (fun l -> let l' = String.strip l in (indentation l, l', l' = "")) e)

where read_paragraph indent returns Some paragraph or None at EOF or when the indentation level is less than indent, and

let collect f x =
  let rec loop acc = match f x with
      None -> List.rev acc
    | Some y -> loop (y :: acc)
  in loop []

Since the parser operates on an enumeration, it can work incrementally, reading a line from disk at a time, if wanted. As an aside, it seems to me that this is one of the two most important use cases for laziness (the other being, well, computations performed on demand, which are easy to do with OCaml's lazy), and it is perfectly satisfied by Enum.

The rest of the code is very top-down:

let rec read_paragraph ?(skip_blank=true) indent e = match Enum.peek e with
    None -> None
  | Some (indentation, line, isblank) -> match isblank with
        true ->
          Enum.junk e;
          if skip_blank then read_paragraph indent e else None
      | false ->
          if indentation < indent then
            None
          else begin
            Enum.junk e;
            read_nonempty indentation e line
          end

and read_nonempty indent e s = match s.[0] with
    '!' -> read_heading s
  | '*' when snd_is_space s -> push_remainder indent s e; read_ul indent e
  | '#' when snd_is_space s -> push_remainder indent s e; read_ol indent e
  | '{' when snd_is s '{' -> read_pre (String.slice s ~first:2) e
  | '>' when snd_is_space s || s = ">" ->
      (* last check needed because "> " becomes ">" *)
      Enum.push e (indent, s, false); read_quote indent e
  | _ -> Enum.push e (indent, s, false); read_normal e

and read_heading s =
  let s' = String.strip ~chars:"!" s in
  let level = String.length s - String.length s' in
    Some (Heading (level, parse_text s'))

and read_ul indent e =
  read_list
    (fun fst others -> Ulist (fst, others))
    (fun s -> snd_is_space s && s.[0] = '*')
    indent e

...

Comments

  1. test

    tehst
    }}}
    
    *sets*
    
    #1 test
    #2 test
    
    the, 07 April 2009 at 13:11#
  2. the

    the

    the

    the?, 07 April 2009 at 13:11#
  3. For those of us who think this code is awesome enough for production use, what is it license?

    Kyle, 07 April 2009 at 13:36#
  4. I can only agree with you: these common markup languages are easily really bad. We tried textile for savonet's website, since it seemed to be standardized, but quickly realized that it was only defined by its broken (and slow) implementation.

    We ended up quickly hacking a simple markup language in ocaml. The particularity of our approach is that we used ocamllex/yacc to generate the AST, with a simple preprocessing function between the two. It gives a fast and strict implementation.

    mrpingouin, 07 April 2009 at 13:41#
  5. Nice comparison. And I like the continuing "OCaml can do it like this ..." theme. ;)

    Did you do any comparisons with some of the other ruby markdown processors? I believe Maruku is a bit faster than Bluecloth, but I'm not sure if it holds up on large documents. I have never looked at its internals in any detail. I remember liking Maruku's easy extension capabilities.

    I believe there is also a ruby extension to discount, as well one to PEG. Ah yes, Tomayko covered them on his blog: http://tomayko.com/writings/ruby-markdown-libraries-real-cheap-for-you-two-for-price-of-one (sure, extensions might be a bit off-topic here. but I've always been a fan of how easy it is to tie Ruby to C.)

    Cameron, 07 April 2009 at 13:49#
  6. Why do you think README.n is a good file to do the comparison?

    And besides, don't you state simple_markup is doing something different than the original markup? Aren't you comparing apples with oranges then?

    However, I agree, gsub() all over the place is not a good design decision (most of the time).

    Andreas, 07 April 2009 at 13:56#
  7. Pretty cool.

    Just a heads up, if others are looking for a super fast Ruby BlueCloth replacement check out rdiscount. Much faster and it's 100% Markdown 1.0 compatible.

    ProbablyCorey, 07 April 2009 at 14:30#
  8. Why are your headers in this post indented over like that?

    Weird

    anonymous, 07 April 2009 at 14:55#
  9. I've been pretty happy with Maruku. Haven't done any benchmarks, but I haven't noticed it being slow, even on large documents. Also supports the PHP-extra extensions, for tables and definition lists, as well as some of its own, which can set html attributes on block-level elements.

    Paul, 07 April 2009 at 15:22#
  10. Markdown's syntax is good because the plain-text version still looks like plain-text. Not code. I'm sure Simple_markup reduces syntactical ambiguity, but at the cost of making the plaintext much less aesthetic.

    anony, 07 April 2009 at 15:31#
  11. Kyle: MIT license.

    Cameron: while the initial motivation to write my own markdown-ish processor was to get acceptable performance, I quickly realized that the ability to operate on the parse tree was essential; doing it in OCaml (which runs this site) is much easier than working with discount's structures in C. A mere "to_html" is not flexible enough, I want to process the parse tree.

    Andreas, anony: Simple_markup is close enough to Markdown to make the comparison meaningful, I think. The most important deviation is the use of {{ for pre-formatted blocks; other than that, it's pretty much like Markdown, and not much less aesthetic IMO, unless you think that __foo__ is much worse than _foo_ (markdown recognizes both IIRC, and my code also used the latter, originally, but I changed to the former after I saw identifiers like some_var causing problems in my reddit comments; I often forgot to escape such uses).

    Why isn't Simple_markup 100% "Markdown compliant" (if such a thing exists, as Markdown is ill-defined and buggy)? Simply because it scratches my personal itch; I just wanted some markup engine for my site and did minor adjustments to the language that made sense to me. I think this is not an apple-to-orange situation because the markup languages share an important subset (pretty much everything but pre blocks and minor differences here and there), and my code could be easily adapted to be fully "Markdown compliant", with no substantial differences in processing speed or code size. It's just that the Simple_markup language works a bit better for me.

    As for the choice of README.n: it's just the first Markdown document I got hold of when I needed to time the processors. It's maybe a bit more list-heavy than an average document, but not utterly unrealistic --- it's a real document, not a synthetic example.

    ProbablyCorey: yup, rdiscount builds on discount and is thus fast. The problem I have with it (the reason why I didn't scrap my code) is that it seems to be limited to markdown.to_html, and there's no easy way to manipulate the parse tree the way I'm doing here.

    mfp, 07 April 2009 at 16:30#
  12. "my code could be easily adapted to be fully 'Markdown compliant', with no substantial differences in processing speed or code size."

    OK, I throw down the gauntlet. Get your code to pass the markdown test suite, and then the performance comparisons will be meaningful.

    crank, 07 April 2009 at 17:05#
  13. You might be interested in my peg-markdown (written in C). It creates a parse tree and currently supports output in HTML, LaTeX, and groff. New output formats can be added very easily, and it is also easy to modify the markup syntax, since it is based on a PEG grammar.

    John MacFarlane, 07 April 2009 at 17:12#
  14. My first thought was to do it in Ragel, the Grammar -> FSM -> C compiler. It turns out someone did that for RedCloth and it's linked off the Ragel homepage:

    http://redhanded.hobix.com/inspect/superredclothAsAChild.html http://www.complang.org/ragel/

    ThirteenNoTrump, 07 April 2009 at 18:13#
  15. Very nice!

    Jon Harrop, 07 April 2009 at 20:26#
  16. ThirteenNoTrump,

    Apparently that was merged into RedCloth, as the FAQ says that RedCloth requires Ragel. So part of why BlueCloth is so slow is that the same hasn't been done for it. There are a couple of reasons why a developer might not bother to make a version of BlueCloth that uses ragel, though:

    • rdiscount is available

    • ragel requires c extension support, so it won't work on JRuby, and you might as well just use rdiscount

    Ben Atkin, 07 April 2009 at 20:43#
  17. I love the idea of being able to add extensions to code blocks. It's so simple! There's a feature I haven't added to a personal wiki because I couldn't figure out maruku's extension methods. This would take the pain out of it!

    Also, I didn't get why you wouldn't support both bold and italics being used at the same time at first. After all, you could always just decide not to use it. Then I realized, however, that it keeps commentors from using it. That is great. If only reddit would limit this abuse and others.

    Ben Atkin, 07 April 2009 at 21:01#
  18. The current version of the Ruby gem RedCloth a textile => html processor which uses Ragel to compile a FSM to C or Java is now maintained by Jason Garber and is located here:

    http://github.com/jgarber/redcloth/tree/master

    Stephen Bannasch, 07 April 2009 at 21:43#
  19. I must be missing something ...?

    Which markdown README are you processing in your benchmarks?

    The one that comes with the original markdown was too small so I found this somewhat longer one:

    http://wpcal.firetree.net/wp-content/plugins/PHP%20Markdown%201.0.1k/PHP%20Markdown%20Readme.text

    When I process that readme concatenated 32 times with the version of bluecloth on github:

    http://github.com/mislav/bluecloth/tree/master

    it takes between 0.03 - 0.04s in both MRI and JRuby

    See: http://gist.github.com/91477

    This is on a MacBook Pro w/a 2.5 GHz Intel Core Duo 2

    Is there a much larger README somewhere else?

    Stephen Bannasch, 07 April 2009 at 22:02#
    1. one

    2. two

    #1 one #2 two

    anonymous, 08 April 2009 at 01:59#
    1. one 2 two

    anonymous, 08 April 2009 at 01:59#
  20. Stephan: I guess README refers to http://daringfireball.net/projects/markdown/syntax.text .

    Anon, 08 April 2009 at 02:03#
  21. http://awk.info/?dsl/markdown It will probably be the slowest implementation, and not feature-complete, but probably the simplest one too.

    yy, 08 April 2009 at 07:25#
  22. I don't suppose you have any idea why Pandoc is so slow? I'm rather surprised that python-markdown was able to beat a compiled language.

    Kevin Ballard, 08 April 2009 at 08:21#
  23. crank: the thing is I don't want to turn this into a bug-for-bug compatible Markdown implementation because from my POV several parts of Markdown are misdesigned (keep in mind that I'm the main user of this markup variant, and the subset of it that is normally used in the comments is nearly identical to Markdown's). My claim ("making this 100% Markdown compliant wouldn't change substantially the performance or the required amount of code") is based on two arguments:

    1. discount is a constructive proof that processing Markdown markup at this speed is possible, and

    2. Simple_markup is close enough to Markdown, and incorporates the trickiest parts (significant indentation, correct nesting of blockquotes, lists, etc.), so not much code would be needed to remove the differences I introduced deliberately. It might even become shorter, as Markdown has got simpler (and inconvenient, if you ask me) rules for block/level termination (extra blank lines required).

    Remember that C and OCaml are in the same league performance-wise, worlds apart from interpreted languages like Perl, Python or Ruby.

    Ben: indeed, the idea is to prevent things like bold, emphasized links (ugh).

    Stephen: this is the README I'm referring to. Your results are starkly different because there are two bugs in your benchmark. You're timing

    n.times { BlueCloth::new(@readme) }
    

    even though the concatenated string is @input_string = @readme * n, and you're not calling the to_html method, which performs the actual conversion. The constructor doesn't do anything besides initializing some instance variables and setting the "restrictions".

    Kevin: Pandoc uses Parsec and character lists (that is, Haskell's native "string" type) instead of actual (packed) strings (ByteString). This way, a char takes at least 16 bytes (instead of one), leading to poor locality plus increased memory demands, and thus to worse performance.

    mfp, 08 April 2009 at 10:34#
  24. Ahh ... thanks.

    Now my MRI benchmark comes out the same as yours -- about 30s for the README concatenated 32 times.

    JRuby running the same bluecloth gem runs the same test in about 14s.

    I've updated my version of the benchmark with new numbers here:

    graph of results

    Stephen Bannasch, 08 April 2009 at 16:56#
  25. Stephen, the graph is interesting as it clearly shows that Bluecloth is scaling quadratically both in MRI and JRuby. I suspected it was the GC (MRI's mark & sweep is actually O(N^2) on allocated data), but this shouldn't be the case of JRuby, so it might be a regexp with nonlinear characteristics(?).

    mfp, 08 April 2009 at 18:15#
  26. I ran a somewhat comparable benchmark with v4.1.9 of RedCloth.

    Jason Garber sent me a 6k file of textile and processing 330k was around 2s in MRI and JRuby.

    The time increase was linear.

    Here's a graph of the data:

    http://img.skitch.com/20090411-5is73f52ftck2h3pyiiyjiu2m.png

    Stephen Bannasch, 11 April 2009 at 07:02#
  27. if your site is run by ocaml, an ocaml library makes a lot of sense. And I can see how access to the parse tree would be useful. I mentioned maruku for two reasons: I think it uses a more formal parser (and one can get a REXML document externally, rather than a #to_html) and I was (selfishly) curious how it measured up.

    Cameron, 11 April 2009 at 22:44#
  28. Cameron: yes, Maruku is clearly not a cheap gsub hack :). I see it's using a state machine, which is equivalent to the approach I followed in my code if you squint enough (the difference is that in my code most of the state is implicit in the call stack). Maruku is fairly large and includes several extensions (lib/maruku/input/ alone takes 2.6 KLoC), so the control flow isn't obvious at the first glance.

    I just ran it against the same files I used for the above-mentioned engines, and got these results:

    README.1   0.31s
    README.2   0.41s
    README.8   1.15s
    README.32  4.8s       56MB RSS
    

    I'm just running the supplied maruku command and measuring with time. As you can see, program initialization takes a while (loading + interpreting code, RubyGems overhead, etc.), but README.32 is large enough to make this a non-issue. Maruku is faster than Bluecloth but much more memory-hungry.

    mfp, 12 April 2009 at 14:36#
  29. Pandoc uses [Char]? Seriously? No wonder it's so slow! Why don't they just change over to ByteString? Parsec can work with that too (although from the docs, Text.Parsec.ByteString uses ByteString.Char8.readFile so it would not support handling of unicode chars, though to be fair Markdown processing shouldn't split up the bytes that make up a utf-8 character so the resulting parse should in fact still be valid utf-8 given a utf-8 source document).

    I wonder how much work it would take to switch over to ByteString?

    Kevin Ballard, 13 April 2009 at 21:55#
  30. As another thought, I wonder if a Parsec Stream instance could be written for Data.Text.Text.

    Kevin Ballard, 13 April 2009 at 21:56#
  31. Nyp9yU <a href="http://czkptmyntzxo.com/">czkptmyntzxo</a>, [url=http://svolposblwmv.com/]svolposblwmv[/url], [link=http://ynqnpqhdsfgl.com/]ynqnpqhdsfgl[/link], http://sxbztvdttamb.com/

    vrhxiqly, 29 May 2009 at 15:58#