When I started to write the code for the latest incarnation of eigenclass, I was planning to use an existent Markdown processor to generate the HTML for the posts and comments dynamically. That'd take at most a couple lines to pipe the markup to a process and read back the HTML. I took the first Markdown implementation that came to mind, Bluecloth (written in Ruby) and ran it against a few documents. I was most underwhelmed by its speed. It was so slow it'd need over one or two seconds to process some of the entries I've written since. I benchmarked other common implementations, the original markdown (in Perl) and python-markdown, and realized that they were only marginally better. At the risk of being perceived as performance-obsessed, here's the observed performance when processing markdown's README (README.n is README concatenated n times) on a 3GHZ AMD64 box (much faster than the old server running this site):
| language | LoCs (approx.) | README.1 time | README.8 time | README.32 time | README.32 MEM | |
|---|---|---|---|---|---|---|
| Bluecloth | Ruby | 1100 | 0.130s | 2.16s | 30s | 31MB |
| markdown | Perl | 1400 | 0.068s | 0.66s | segfault | segfault |
| python-markdown | Python | 1900 | 0.090s | 0.35s | 2.06s | 23MB |
| Pandoc | Haskell | 900 + 450 | 0.068s | 0.55s | 2.7s | 25MB |
Compare to the rather acceptable performance of my own Simple_markup module in OCaml, and of discount, a C implementation I found when I had already written mine:
| language | LoCs (approx.) | README.8 time | README.32 time | README.32 MEM | |
|---|---|---|---|---|---|
| Simple_markup | OCaml | 313 + 55 | 12ms | 43ms | 3.5MB |
| discount | C | ~4500 | 16ms | 63ms | 2.8MB |
(The LoC counts for Simple_markup and Pandoc are split into parsing and HTML generation.)
I didn't do any attempt to optimize Simple_markup beyond replacing a single O(n^2) call to String.nsplit with a O(n) Str.split one in order to split the input string into lines. I'm not compiling with -unsafe or -nodynlink either.
To add insult to injury, Bluecloth, markdown and python-markdown are ugly hacks that boil down to iterated regexp-based gsubs. I can see why they have a long history of bugs: it is easy for a gsub pass to interfere accidentally with another, and such regexp-based transformations are full of corner cases.
A much cleaner approach is to parse the markup into a parse tree, and then generate the (X)HTML in a separate pass. This is what Pandoc, discount and Simple_markup do.
Recognized markup
Simple_markup deviates from Markdown in a few ways:
any character can be escaped with \,
emphasis is done with __ (less prone to accidental use than _) and bold text with *,
typographical abuses like bold emphasized text are not allowed,
headers are done with
!level1 header,!!level2 header, etc.,the
#character is thus free and can be used for numbered lists, replacing1.,pre-formatted code is done with
{{ whatever }}instead of using the indentation (less error-prone), and
it is possible to extend the markup with custom processors, using
{{extension-nameblocks; for instance, raw HTML is inserted with{{html <b> whatever </b> }}(before you try to inject arbitrary HTML in the comments: this extension is only enabled in the main text).
Some comments about the implementation
The bulk of the code parses a stream of lines (I use Extlib/Batteries' Enum) into a list of paragraphs, defined as:
type paragraph =
Normal of par_text
| Pre of string * string option
| Heading of int * par_text
| Quote of paragraph list
| Ulist of paragraph list * paragraph list list
| Olist of paragraph list * paragraph list list
and par_text = text list
and text =
Text of string
| Emph of string
| Bold of string
| Struck of par_text
| Code of string
| Link of href
| Anchor of string
| Image of img_ref
and href = { href_target : string; href_desc : string; }
and img_ref = { img_src : string; img_alt : string; }
The Ulist and Olist constructors take the first item followed by a (possible empty) list of items, to prevent empty lists --- this way, there's at least one element.
Note that Emph and Bold don't contain more (styled) text, but only strings. It'd be rather easy to allow this by using polymorphic variants, as follows:
and unstyled_text = [`Text of string]
and simple_text = [unstyled_text | `Emph of simple_text
|`Bold of simple_text | `Struck of simple_text]
and text = [ simple_text | `Code of string | ... ]
The same goes for styled text inside links. I chose not to do it, though, because bold emphasized text is just poor style.
The top-down parser takes 313 LoC and is not very complex, in spite of markdown's use of significant indentation. At its core, it's just a set of recursive functions that take an enumeration of (int * string * bool) tuples with the indentation, the string and whether the line is empty, for convenience:
let parse_enum e =
collect (read_paragraph 0)
(Enum.map (fun l -> let l' = String.strip l in (indentation l, l', l' = "")) e)
where read_paragraph indent returns Some paragraph or None at EOF or when the indentation level is less than indent, and
let collect f x =
let rec loop acc = match f x with
None -> List.rev acc
| Some y -> loop (y :: acc)
in loop []
Since the parser operates on an enumeration, it can work incrementally, reading a line from disk at a time, if wanted. As an aside, it seems to me that this is one of the two most important use cases for laziness (the other being, well, computations performed on demand, which are easy to do with OCaml's lazy), and it is perfectly satisfied by Enum.
The rest of the code is very top-down:
let rec read_paragraph ?(skip_blank=true) indent e = match Enum.peek e with
None -> None
| Some (indentation, line, isblank) -> match isblank with
true ->
Enum.junk e;
if skip_blank then read_paragraph indent e else None
| false ->
if indentation < indent then
None
else begin
Enum.junk e;
read_nonempty indentation e line
end
and read_nonempty indent e s = match s.[0] with
'!' -> read_heading s
| '*' when snd_is_space s -> push_remainder indent s e; read_ul indent e
| '#' when snd_is_space s -> push_remainder indent s e; read_ol indent e
| '{' when snd_is s '{' -> read_pre (String.slice s ~first:2) e
| '>' when snd_is_space s || s = ">" ->
(* last check needed because "> " becomes ">" *)
Enum.push e (indent, s, false); read_quote indent e
| _ -> Enum.push e (indent, s, false); read_normal e
and read_heading s =
let s' = String.strip ~chars:"!" s in
let level = String.length s - String.length s' in
Some (Heading (level, parse_text s'))
and read_ul indent e =
read_list
(fun fst others -> Ulist (fst, others))
(fun s -> snd_is_space s && s.[0] = '*')
indent e
...
Comments
test
the
the
the
For those of us who think this code is awesome enough for production use, what is it license?
I can only agree with you: these common markup languages are easily really bad. We tried textile for savonet's website, since it seemed to be standardized, but quickly realized that it was only defined by its broken (and slow) implementation.
We ended up quickly hacking a simple markup language in ocaml. The particularity of our approach is that we used ocamllex/yacc to generate the AST, with a simple preprocessing function between the two. It gives a fast and strict implementation.
Nice comparison. And I like the continuing "OCaml can do it like this ..." theme. ;)
Did you do any comparisons with some of the other ruby markdown processors? I believe Maruku is a bit faster than Bluecloth, but I'm not sure if it holds up on large documents. I have never looked at its internals in any detail. I remember liking Maruku's easy extension capabilities.
I believe there is also a ruby extension to discount, as well one to PEG. Ah yes, Tomayko covered them on his blog: http://tomayko.com/writings/ruby-markdown-libraries-real-cheap-for-you-two-for-price-of-one (sure, extensions might be a bit off-topic here. but I've always been a fan of how easy it is to tie Ruby to C.)
Why do you think README.n is a good file to do the comparison?
And besides, don't you state simple_markup is doing something different than the original markup? Aren't you comparing apples with oranges then?
However, I agree, gsub() all over the place is not a good design decision (most of the time).
Pretty cool.
Just a heads up, if others are looking for a super fast Ruby BlueCloth replacement check out rdiscount. Much faster and it's 100% Markdown 1.0 compatible.
Why are your headers in this post indented over like that?
Weird
I've been pretty happy with Maruku. Haven't done any benchmarks, but I haven't noticed it being slow, even on large documents. Also supports the PHP-extra extensions, for tables and definition lists, as well as some of its own, which can set html attributes on block-level elements.
Markdown's syntax is good because the plain-text version still looks like plain-text. Not code. I'm sure Simple_markup reduces syntactical ambiguity, but at the cost of making the plaintext much less aesthetic.
Kyle: MIT license.
Cameron: while the initial motivation to write my own markdown-ish processor was to get acceptable performance, I quickly realized that the ability to operate on the parse tree was essential; doing it in OCaml (which runs this site) is much easier than working with
discount's structures in C. A mere "to_html" is not flexible enough, I want to process the parse tree.Andreas, anony: Simple_markup is close enough to Markdown to make the comparison meaningful, I think. The most important deviation is the use of
{{for pre-formatted blocks; other than that, it's pretty much like Markdown, and not much less aesthetic IMO, unless you think that__foo__is much worse than_foo_(markdownrecognizes both IIRC, and my code also used the latter, originally, but I changed to the former after I saw identifiers likesome_varcausing problems in my reddit comments; I often forgot to escape such uses).Why isn't Simple_markup 100% "Markdown compliant" (if such a thing exists, as Markdown is ill-defined and buggy)? Simply because it scratches my personal itch; I just wanted some markup engine for my site and did minor adjustments to the language that made sense to me. I think this is not an apple-to-orange situation because the markup languages share an important subset (pretty much everything but
preblocks and minor differences here and there), and my code could be easily adapted to be fully "Markdown compliant", with no substantial differences in processing speed or code size. It's just that the Simple_markup language works a bit better for me.As for the choice of README.n: it's just the first Markdown document I got hold of when I needed to time the processors. It's maybe a bit more list-heavy than an average document, but not utterly unrealistic --- it's a real document, not a synthetic example.
ProbablyCorey: yup,
rdiscountbuilds ondiscountand is thus fast. The problem I have with it (the reason why I didn't scrap my code) is that it seems to be limited tomarkdown.to_html, and there's no easy way to manipulate the parse tree the way I'm doing here."my code could be easily adapted to be fully 'Markdown compliant', with no substantial differences in processing speed or code size."
OK, I throw down the gauntlet. Get your code to pass the markdown test suite, and then the performance comparisons will be meaningful.
You might be interested in my peg-markdown (written in C). It creates a parse tree and currently supports output in HTML, LaTeX, and groff. New output formats can be added very easily, and it is also easy to modify the markup syntax, since it is based on a PEG grammar.
My first thought was to do it in Ragel, the Grammar -> FSM -> C compiler. It turns out someone did that for RedCloth and it's linked off the Ragel homepage:
http://redhanded.hobix.com/inspect/superredclothAsAChild.html http://www.complang.org/ragel/
Very nice!
ThirteenNoTrump,
Apparently that was merged into RedCloth, as the FAQ says that RedCloth requires Ragel. So part of why BlueCloth is so slow is that the same hasn't been done for it. There are a couple of reasons why a developer might not bother to make a version of BlueCloth that uses ragel, though:
rdiscount is available
ragel requires c extension support, so it won't work on JRuby, and you might as well just use rdiscount
I love the idea of being able to add extensions to code blocks. It's so simple! There's a feature I haven't added to a personal wiki because I couldn't figure out maruku's extension methods. This would take the pain out of it!
Also, I didn't get why you wouldn't support both bold and italics being used at the same time at first. After all, you could always just decide not to use it. Then I realized, however, that it keeps commentors from using it. That is great. If only reddit would limit this abuse and others.
The current version of the Ruby gem RedCloth a textile => html processor which uses Ragel to compile a FSM to C or Java is now maintained by Jason Garber and is located here:
http://github.com/jgarber/redcloth/tree/master
I must be missing something ...?
Which markdown README are you processing in your benchmarks?
The one that comes with the original markdown was too small so I found this somewhat longer one:
http://wpcal.firetree.net/wp-content/plugins/PHP%20Markdown%201.0.1k/PHP%20Markdown%20Readme.text
When I process that readme concatenated 32 times with the version of bluecloth on github:
http://github.com/mislav/bluecloth/tree/master
it takes between 0.03 - 0.04s in both MRI and JRuby
See: http://gist.github.com/91477
This is on a MacBook Pro w/a 2.5 GHz Intel Core Duo 2
Is there a much larger README somewhere else?
one
two
#1 one #2 two
one 2 two
Stephan: I guess README refers to http://daringfireball.net/projects/markdown/syntax.text .
http://awk.info/?dsl/markdown It will probably be the slowest implementation, and not feature-complete, but probably the simplest one too.
I don't suppose you have any idea why Pandoc is so slow? I'm rather surprised that python-markdown was able to beat a compiled language.
crank: the thing is I don't want to turn this into a bug-for-bug compatible Markdown implementation because from my POV several parts of Markdown are misdesigned (keep in mind that I'm the main user of this markup variant, and the subset of it that is normally used in the comments is nearly identical to Markdown's). My claim ("making this 100% Markdown compliant wouldn't change substantially the performance or the required amount of code") is based on two arguments:
discount is a constructive proof that processing Markdown markup at this speed is possible, and
Simple_markup is close enough to Markdown, and incorporates the trickiest parts (significant indentation, correct nesting of blockquotes, lists, etc.), so not much code would be needed to remove the differences I introduced deliberately. It might even become shorter, as Markdown has got simpler (and inconvenient, if you ask me) rules for block/level termination (extra blank lines required).
Remember that C and OCaml are in the same league performance-wise, worlds apart from interpreted languages like Perl, Python or Ruby.
Ben: indeed, the idea is to prevent things like bold, emphasized links (ugh).
Stephen: this is the README I'm referring to. Your results are starkly different because there are two bugs in your benchmark. You're timing
n.times { BlueCloth::new(@readme) }even though the concatenated string is
@input_string = @readme * n, and you're not calling theto_htmlmethod, which performs the actual conversion. The constructor doesn't do anything besides initializing some instance variables and setting the "restrictions".Kevin: Pandoc uses Parsec and character lists (that is, Haskell's native "string" type) instead of actual (packed) strings (ByteString). This way, a char takes at least 16 bytes (instead of one), leading to poor locality plus increased memory demands, and thus to worse performance.
Ahh ... thanks.
Now my MRI benchmark comes out the same as yours -- about 30s for the README concatenated 32 times.
JRuby running the same bluecloth gem runs the same test in about 14s.
I've updated my version of the benchmark with new numbers here:
graph of results
Stephen, the graph is interesting as it clearly shows that Bluecloth is scaling quadratically both in MRI and JRuby. I suspected it was the GC (MRI's mark & sweep is actually O(N^2) on allocated data), but this shouldn't be the case of JRuby, so it might be a regexp with nonlinear characteristics(?).
I ran a somewhat comparable benchmark with v4.1.9 of RedCloth.
Jason Garber sent me a 6k file of textile and processing 330k was around 2s in MRI and JRuby.
The time increase was linear.
Here's a graph of the data:
http://img.skitch.com/20090411-5is73f52ftck2h3pyiiyjiu2m.png
if your site is run by ocaml, an ocaml library makes a lot of sense. And I can see how access to the parse tree would be useful. I mentioned maruku for two reasons: I think it uses a more formal parser (and one can get a REXML document externally, rather than a #to_html) and I was (selfishly) curious how it measured up.
Cameron: yes, Maruku is clearly not a cheap gsub hack :). I see it's using a state machine, which is equivalent to the approach I followed in my code if you squint enough (the difference is that in my code most of the state is implicit in the call stack). Maruku is fairly large and includes several extensions (
lib/maruku/input/alone takes 2.6 KLoC), so the control flow isn't obvious at the first glance.I just ran it against the same files I used for the above-mentioned engines, and got these results:
I'm just running the supplied
marukucommand and measuring withtime. As you can see, program initialization takes a while (loading + interpreting code, RubyGems overhead, etc.), butREADME.32is large enough to make this a non-issue. Maruku is faster than Bluecloth but much more memory-hungry.Pandoc uses [Char]? Seriously? No wonder it's so slow! Why don't they just change over to ByteString? Parsec can work with that too (although from the docs, Text.Parsec.ByteString uses ByteString.Char8.readFile so it would not support handling of unicode chars, though to be fair Markdown processing shouldn't split up the bytes that make up a utf-8 character so the resulting parse should in fact still be valid utf-8 given a utf-8 source document).
I wonder how much work it would take to switch over to ByteString?
As another thought, I wonder if a Parsec Stream instance could be written for Data.Text.Text.
Nyp9yU <a href="http://czkptmyntzxo.com/">czkptmyntzxo</a>, [url=http://svolposblwmv.com/]svolposblwmv[/url], [link=http://ynqnpqhdsfgl.com/]ynqnpqhdsfgl[/link], http://sxbztvdttamb.com/