Even though extprot supports arbitrarily complex data types (with structures, lists, arrays, tuples and disjoint unions), the encoding is reasonably simple. A surprisingly efficient (relative to Marshal.load, more on that below) "universal decoder" takes under 100 lines of Ruby code. It is "universal" in the sense that it can decode any extprot message without access to the protocol definition, by virtue of the format being self-describing. It being binary is not scary anymore.
A simple protocol / serialization format
This example is an adaptation of the one found in the Protocol Buffers tutorial for Python. The main difference is that there is no need to define a field ID in extprot (fields are encoded positionally). extprot's abstract syntax is more succinct, too:
type optional 'a = Unset | Set 'a
type phone_type = Mobile | Home | Work
message person = {
name : string;
id : int;
email : optional<string>;
phones : [ (string * phone_type) ]
}
message address_book = { persons : [ person ] }
Let's create a person; I'll serialize it from OCaml, and deserialize it in Ruby, using the "universal decoder". Here's how I generate the record from the OCaml toplevel (REPL in OCaml parlance):
(The lines starting with "#" are my input, terminated by ";;"; the ones following them are the response by the REPL.)
# let p = { Person.name = "John Doe"; id = 1234; email = Set "jdoe@example.com";
phones = ["555-4321", Phone_type.Home] };;
val p : Person.person =
{Person.name = "John Doe"; Person.id = 1234;
Person.email = Optional.Set "jdoe@example.com";
Person.phones = [("555-4321", Phone_type.Home)]}
# Std.output_file ~filename:"person.dat" ~text:(Extprot.Conv.serialize Person.write_person p);;
- : unit = ()
This generates a 54-byte file with the serialized person:
00000000 01 34 04 03 08 4a 6f 68 6e 20 44 6f 65 00 a4 13 |.4...John Doe...| 00000010 01 13 01 03 10 6a 64 6f 65 40 65 78 61 6d 70 6c |.....jdoe@exampl| 00000020 65 2e 63 6f 6d 05 0f 01 01 0c 02 03 08 35 35 35 |e.com........555| 00000030 2d 34 33 32 31 1a |-4321.| 00000036
And here's the output by the universal decoder in Ruby:
T0 ["John Doe", 1234, T0 ["jdoe@example.com"], [T0 ["555-4321", E1]]]
(I have redefined the #inspect method in the classes representing tuples/structures and constants.)
Comparison to Marshal
The above structure, which took 54 bytes in the extprot serialization, weighs 145 bytes when serialized with Marshal.dump. The reason is that the latter includes the class and field names (which I argue is a bad thing to do, as it exposes implementation details) and doesn't use particularly efficient encodings of integers and other values.
In order to compare the relative speed of Marshal.load (written in C) and the naïve extprot decoder written in Ruby (suboptimal and not optimized, for clarity --- the code can be found below), I generated an address book with over 18000 random person records. Here's what I measured:
Update: added the times for the new, de-pessimized pure Ruby decoder in 80 lines, substantially faster
| Serialization size (bytes) | Deserialization time | |
|---|---|---|
| Marshal.load(from string) (Ruby core method in C) | 3527449 | 1.29s |
| Marshal.load(from IO) | 3527449 | 1.65s |
| naïve extprot decoder, from IO (pure Ruby) | 1859128 | 3.03s |
| de-pessimized extprot decoder, from IO (pure Ruby) | 1859128 | 2.17s |
Marshal.load being only a bit over twice as fast as the naïve decoder (in the most favorable case) 30% faster than a pure Ruby decoder was unexpected. Keep in mind that it's C code versus suboptimal Ruby code. A posteriori, I can think of a few possible reasons (memory allocation being slow, Marshal's rather inefficient encoding, etc.).
A universal decoder in under 100 lines of Ruby
First of all, let's create some classes for the complex data types, and a couple exceptions for errors found while decoding:
class HTuple < Array; attr_accessor :tag end
class Enum < Struct.new(:tag); def inspect; "E#{tag}" end end
class Tuple < Array
attr_accessor :tag
def inspect; "T#{tag} " + super end
end
class Assoc < Hash
attr_accessor :tag
def initialize(h,t)
update!(h)
@tag = t
end
end
class ExtprotError < StandardError; end
class BadWireType < ExtprotError; end
Here, HTuple is used for lists/arrays, and Tuple for actual tuples and structures (since they are encoded the same way, you can promote a tuple to a structure in the protocol definition at any time without breaking compatibility).
In extprot's encoding, each value is preceded by a prefix, which indicates the (wire) type, the tag, and the size (either implicitly, if it's fixed, or explicitly, for variable length types). The prefix is but an integer (encoded as a variable length integer, see below), and the type and tag parts are extracted trivially:
module Codec
@@types = [
:vint, :tuple, :bits8, :bytes,
:bits32, :htuple, :bits64_long, :assoc,
:bits64_float, :invalid, :enum, :invalid,
:invalid, :invalid, :invalid, :invalid
]
def ll_type(n); @@types[n & 0xf] end
def ll_tag(n); n >> 4 end
module_function :ll_type, :ll_tag
end
For the sake of ease of presentation, I decompose the decoding of a value into two parts:
low-level routines for reading the prefix and the basic types
the actual decoding process, that constructs the value
Amongst the routines defined in (1), read_vint is the most important. It reads variable size integers, encoded in little-endian format and using the MSB of each byte to indicate whether the integer continues or not:
class Reader
include Codec
def read_vint
b = read_byte
return b if b < 128
x = e = 0
while b >= 128
x += (b - 128) << e
e += 7
b = read_byte
end
x + (b << e)
end
alias_method :read_prefix, :read_vint
def initialize(io); @io = io end
def read_bytes(n); @io.read(n) end
def read_byte; @io.readchar end
def read_raw_bool; read_byte != 0 end
def read_raw_i8; read_byte end
def read_raw_rel_int; n = read_vint; (n >> 1) ^ -(n & 1) end
def read_raw_i32; read_bytes(4).unpack("V")[0] end
def read_raw_i64; read_bytes(8).unpack("q")[0] end
def read_raw_float; read_bytes(8).unpack("E") end
def read_raw_string; len = read_vint; read_bytes(len) end
end
The higher-level part builds the values (e.g., it builds the tuples and lists) by using the above routines:
class Decoder
include Codec
def initialize(reader); @rd = reader end
def read_value
prefix = @rd.read_prefix
tag = ll_tag(prefix)
case ll_type(prefix)
when :tuple; t = read_htuple(prefix); r = Tuple.new(t); r.tag = t.tag; r
when :htuple; read_htuple(prefix)
when :assoc; read_assoc(prefix)
when :vint; @rd.read_raw_rel_int
when :bits8; @rd.read_byte
when :bits32; @rd.read_raw_i32
when :bits64_long; @rd.read_raw_i64
when :bits64_float; @rd.read_raw_float
when :enum; Enum.new(tag)
when :bytes; @rd.read_raw_string
when :invalid; raise BadWireType
end
end
private
def read_htuple(prefix)
tag = ll_tag(prefix)
_ = @rd.read_vint
nelms = @rd.read_vint
a = Array.new(nelms)
nelms.times{|i| a[i] = read_value }
r = HTuple.new(a)
r.tag = tag
r
end
def read_assoc(prefix)
tag = ll_tag(prefix)
_ = @rd.read_vint
nelms = @rd.read_vint
r = {}
nelms.times do |i|
k = read_value
v = read_value
r[k] = v
end
Assoc.new(r, tag)
end
end
It doesn't get much simpler than that.
What's next
The "universal decoder" can be compared to a JSON decoder in this case (or a restricted Marshal.load that works for a specific family of types). It operates without access to the protocol definition, and handling the resulting value is relatively bothersome --- the same way traversing an XML tree would be (except with numerical constants instead of tag names).
This decoder doesn't handle the compatibility-preserving extensions extprot allows, meaning that the client code would have to deal with them --- again, the same way you might have to change your code if an XML schema changes.
The next step is to support such protocol extensions automagically, so that the code needs no change to interoperate with newer protocol versions, and to generate class hierarchies that mirror the type definitions in the protocol, allowing to manipulate deserialized values easily.