Chuck Vose's comment made me realize that the universal extprot message decoder can be simplified considerably if I simply deserialize the data and let Ruby do the pretty-printing for me (#inspect). I now have a 120 LoC universal decoder that can deserialize any message (without the original protocol definition), and can exchange data between OCaml and Ruby, the first extprot targets. But before I come to that, some clarifications on Ruby's Marshal vs. extprot are in order.

What's the point? Why use extprot instead of Marshal.dump?

Marshal has been a core Ruby class forever. It is written in C, fairly fast (it's the fastest way to (de)serialize Ruby data, at any rate), and convenient to use: you just give it an object (nearly any object), and you get a string. Give it a string, and your object's back. Why would anybody want to use anything else? In fact, there are a few reasons not to use Marshal:

  1. the format used by Marshal has changed a few times in the past. (The minor version has changed 8 times since the first release.)

  2. it's Ruby-only. AFAIK nothing else can read data serialized with Marshal.

  3. serializing objects with Marshal exposes implementation details.

While (2) means you cannot use it if you care about interoperability, (3) applies also when you're staying in Ruby. A redditor puts it in few words:

It's really infuriating when you (for example) can't send serialized ruby objects over the network because of a 0.0.1 version difference between the 2.

The basic problem with Marshal is that it serializes an object by saving the name of its class and all its instance variables (you can spot them in the generated string):

>> A = Struct.new(:name, :id, :email, :phones)
=> A
>> s = Marshal.dump(A.new("John Doe", 1234, "jdoe@example.com", ["555-4321", 1]))
=> "\004\bS:\006A\t:\tname\"\rJohn Doe:\aidi\002\322\004:\n
    email\"\025jdoe@example.com:\vphones[\a\"\r555-4321i\006"
>> s.size
=> 78

The first, obvious consequence is that you cannot change the name of the class, since Marshal.load needs the class declared in the byte stream to exist. The second, no less apparent, is that you cannot rename the instance variables either. But it gets worse than that: this all means that you're exposing many implementation details (how the data is represented in which instance variables) in the serialized form, details that you will hardly be able to modify, if you want to read old data (worse: ... that you won't be able to change at all if you want old clients to read new data). This can be addressed in an ad-hoc manner by using #marshal_dump and #marshal_load, but this requires extra code and implies that you are no longer able to decode the data if you don't have #marshal_load: effectively, #marshal_dump and #marshal_load define a protocol.

Now that the magic word, protocol, has been uttered, it's time to see if Marshal does anything for us as far a protocol extensibility is concerned. As said above, if anything, Marshal makes interoperability harder, as the encoding is not guaranteed not to change (in practice, it's not expected to change often, but we can't know) and implementation details are leaked by default.

As it turns out, Marshal doesn't help with the sort of backward/forward compatible protocol extensions extprot allows either.

In that example, a color field is extended from a single grayscale value to a HSV color in a backward and forward compatible way. With Marshal, old data would be serialized as a plain Fixnum, and this is also what you would get back in an updated client --- no reference to the hue or saturation fields. Newer clients have the advantage of hindsight, so they can cope with that: you can hand-code a method that uses Marshal.load and then promotes the grayscale value to a full HSV color, for instance. This is not a choice in older clients: when exposed to newer data, Marshal.load will see a reference to a non-existent (say) HSVColor class and bomb. With perfect foresight, the grayscale value would have been serialized as a 1-element array, and all references to the grayscale value in the old code would take the first element. Almost by definition, however, protocol extensions cannot be foreseen, in general, so this is not practical. Taking that approach to the limit, all instance variables should be serialized as arrays, just in case you want to extend them, and numeric indexes would be needed to access their values. Nasty!

extprot, on the other hand, is designed to address these issues. If the original protocol is

message grafff = { objects : [ (shape * byte) ] }

(a grafff holds a list of shape + grayscale value tuples), and you later choose to change it so that objects can be of different colors, you can do

message color = { value : byte; hue : opt_byte; saturation : opt_byte }
message grafff = { objects : [ (shape * color) ] }

and everything _just works_. Old clients can read new data, new clients can read old data. The former simply ignore the information they don't know about (hue and saturation), the latter use default values for the missing fields.