Even though extprot supports arbitrarily complex data types (with structures, lists, arrays, tuples and disjoint unions), the encoding is reasonably simple. A surprisingly efficient (relative to Marshal.load, more on that below) "universal decoder" takes under 100 lines of Ruby code. It is "universal" in the sense that it can decode any extprot message without access to the protocol definition, by virtue of the format being self-describing. It being binary is not scary anymore.

A simple protocol / serialization format

This example is an adaptation of the one found in the Protocol Buffers tutorial for Python. The main difference is that there is no need to define a field ID in extprot (fields are encoded positionally). extprot's abstract syntax is more succinct, too:

type optional 'a = Unset | Set 'a
type phone_type = Mobile | Home | Work

message person = {
  name : string;
  id : int;
  email : optional<string>;
  phones : [ (string * phone_type) ]
}

message address_book = { persons : [ person ] }

Let's create a person; I'll serialize it from OCaml, and deserialize it in Ruby, using the "universal decoder". Here's how I generate the record from the OCaml toplevel (REPL in OCaml parlance):

(The lines starting with "#" are my input, terminated by ";;"; the ones following them are the response by the REPL.)

# let p = { Person.name = "John Doe"; id = 1234; email = Set "jdoe@example.com";
            phones = ["555-4321", Phone_type.Home] };;
val p : Person.person =
  {Person.name = "John Doe"; Person.id = 1234;
   Person.email = Optional.Set "jdoe@example.com";
   Person.phones = [("555-4321", Phone_type.Home)]}
# Std.output_file ~filename:"person.dat" ~text:(Extprot.Conv.serialize Person.write_person p);;
- : unit = ()

This generates a 54-byte file with the serialized person:

00000000  01 34 04 03 08 4a 6f 68  6e 20 44 6f 65 00 a4 13  |.4...John Doe...|
00000010  01 13 01 03 10 6a 64 6f  65 40 65 78 61 6d 70 6c  |.....jdoe@exampl|
00000020  65 2e 63 6f 6d 05 0f 01  01 0c 02 03 08 35 35 35  |e.com........555|
00000030  2d 34 33 32 31 1a                                 |-4321.|
00000036

And here's the output by the universal decoder in Ruby:

T0 ["John Doe", 1234, T0 ["jdoe@example.com"], [T0 ["555-4321", E1]]]

(I have redefined the #inspect method in the classes representing tuples/structures and constants.)

Comparison to Marshal

The above structure, which took 54 bytes in the extprot serialization, weighs 145 bytes when serialized with Marshal.dump. The reason is that the latter includes the class and field names (which I argue is a bad thing to do, as it exposes implementation details) and doesn't use particularly efficient encodings of integers and other values.

In order to compare the relative speed of Marshal.load (written in C) and the naïve extprot decoder written in Ruby (suboptimal and not optimized, for clarity --- the code can be found below), I generated an address book with over 18000 random person records. Here's what I measured:

Update: added the times for the new, de-pessimized pure Ruby decoder in 80 lines, substantially faster

Serialization size (bytes) Deserialization time
Marshal.load(from string) (Ruby core method in C) 3527449 1.29s
Marshal.load(from IO) 3527449 1.65s
naïve extprot decoder, from IO (pure Ruby) 1859128 3.03s
de-pessimized extprot decoder, from IO (pure Ruby) 1859128 2.17s

Marshal.load being only a bit over twice as fast as the naïve decoder (in the most favorable case) 30% faster than a pure Ruby decoder was unexpected. Keep in mind that it's C code versus suboptimal Ruby code. A posteriori, I can think of a few possible reasons (memory allocation being slow, Marshal's rather inefficient encoding, etc.).

A universal decoder in under 100 lines of Ruby

First of all, let's create some classes for the complex data types, and a couple exceptions for errors found while decoding:

class HTuple < Array; attr_accessor :tag end
class Enum < Struct.new(:tag); def inspect; "E#{tag}" end end

class Tuple < Array
  attr_accessor :tag
  def inspect; "T#{tag} " + super end
end

class Assoc < Hash
  attr_accessor :tag
  def initialize(h,t)
    update!(h)
    @tag = t
  end
end

class ExtprotError < StandardError; end
class BadWireType < ExtprotError; end

Here, HTuple is used for lists/arrays, and Tuple for actual tuples and structures (since they are encoded the same way, you can promote a tuple to a structure in the protocol definition at any time without breaking compatibility).

In extprot's encoding, each value is preceded by a prefix, which indicates the (wire) type, the tag, and the size (either implicitly, if it's fixed, or explicitly, for variable length types). The prefix is but an integer (encoded as a variable length integer, see below), and the type and tag parts are extracted trivially:

module Codec
  @@types = [
    :vint, :tuple, :bits8, :bytes,
    :bits32, :htuple, :bits64_long, :assoc,
    :bits64_float, :invalid, :enum, :invalid,
    :invalid, :invalid, :invalid, :invalid
  ]

  def ll_type(n); @@types[n & 0xf] end
  def ll_tag(n); n >> 4 end

  module_function :ll_type, :ll_tag
end

For the sake of ease of presentation, I decompose the decoding of a value into two parts:

  1. low-level routines for reading the prefix and the basic types

  2. the actual decoding process, that constructs the value

Amongst the routines defined in (1), read_vint is the most important. It reads variable size integers, encoded in little-endian format and using the MSB of each byte to indicate whether the integer continues or not:

class Reader
  include Codec

  def read_vint
    b = read_byte
    return b if b < 128
    x = e = 0
    while b >= 128
      x += (b - 128) << e
      e += 7
      b = read_byte
    end
    x + (b << e)
  end
  alias_method :read_prefix, :read_vint

  def initialize(io); @io = io end
  def read_bytes(n); @io.read(n) end
  def read_byte; @io.readchar end
  def read_raw_bool; read_byte != 0 end
  def read_raw_i8; read_byte end
  def read_raw_rel_int; n = read_vint; (n >> 1) ^ -(n & 1) end
  def read_raw_i32; read_bytes(4).unpack("V")[0] end
  def read_raw_i64; read_bytes(8).unpack("q")[0] end
  def read_raw_float; read_bytes(8).unpack("E") end
  def read_raw_string; len = read_vint; read_bytes(len) end
end

The higher-level part builds the values (e.g., it builds the tuples and lists) by using the above routines:

class Decoder
  include Codec

  def initialize(reader); @rd = reader end

  def read_value
    prefix = @rd.read_prefix
    tag = ll_tag(prefix)
    case ll_type(prefix)
    when :tuple; t = read_htuple(prefix); r = Tuple.new(t); r.tag = t.tag; r
    when :htuple; read_htuple(prefix)
    when :assoc; read_assoc(prefix)
    when :vint; @rd.read_raw_rel_int
    when :bits8; @rd.read_byte
    when :bits32; @rd.read_raw_i32
    when :bits64_long; @rd.read_raw_i64
    when :bits64_float; @rd.read_raw_float
    when :enum; Enum.new(tag)
    when :bytes; @rd.read_raw_string
    when :invalid; raise BadWireType
    end
  end

  private

  def read_htuple(prefix)
    tag = ll_tag(prefix)
    _ = @rd.read_vint
    nelms = @rd.read_vint
    a = Array.new(nelms)
    nelms.times{|i| a[i] = read_value }
    r = HTuple.new(a)
    r.tag = tag
    r
  end

  def read_assoc(prefix)
    tag = ll_tag(prefix)
    _ = @rd.read_vint
    nelms = @rd.read_vint
    r = {}
    nelms.times do |i|
      k = read_value
      v = read_value
      r[k] = v
    end
    Assoc.new(r, tag)
  end
end

It doesn't get much simpler than that.

What's next

The "universal decoder" can be compared to a JSON decoder in this case (or a restricted Marshal.load that works for a specific family of types). It operates without access to the protocol definition, and handling the resulting value is relatively bothersome --- the same way traversing an XML tree would be (except with numerical constants instead of tag names).

This decoder doesn't handle the compatibility-preserving extensions extprot allows, meaning that the client code would have to deal with them --- again, the same way you might have to change your code if an XML schema changes.

The next step is to support such protocol extensions automagically, so that the code needs no change to interoperate with newer protocol versions, and to generate class hierarchies that mirror the type definitions in the protocol, allowing to manipulate deserialized values easily.