I have been working for a while on extprot, a tool that allows you to create compact, efficient and extensible binary protocols that can be used for cross-language communication and long-term data serialization. extprot supports protocols with rich, composable types, whose definition can evolve while keeping both forward and backward compatibility.
The protocols created using extprot are:
extensible: types can be extended in several ways without breaking compatibility with existent producers/consumers
self-delimited: each message indicates its own length. This allows you to send sequences of messages (streaming) without having to add message delimiters.
self-describing: a message can be decoded even without the protocol definition. What you get is roughly equivalent to XML without the DTD.
compact: 2 to >6 times less space than XML, typically 2 to 4 times less space than individual, compressed XML messages.
fast: can be deserialized one to two orders of magnitude faster than XML, and faster than it’d take to merely uncompress XML data.
The extprot compiler (extprotc) takes a protocol description and generates code in any of the supported languages to serialize and deserialize the associated data structures. It is accompanied by a runtime library for each target language which is used to read and write the structures defined by the protocol.
At this point, you'll be thinking, "what, yet another Protocol Buffers/Thrift/ASN.1 DER/XDR/IIOP/IFF?"... Not quite: extprot differentiates itself in that it allows for more extensions and supports richer types (mainly tuples and disjoint union types aka. sum types) than Protocol Buffers or Thrift without approaching the complexity of ASN.1 DER. (Note that XDR does not define self-describing protocols, making protocol changes hard at best.)
There are three parts to extprot
The improved modifiability and richer data types show in (2) and (3). (1) is similar to Protocol Buffer's, but differs in significant ways.
Example
Here's a trivial protocol definition:
(* this is a comment (* and this a nested comment *) *)
message user = {
id : int;
name : string;
}
The value
{ id = 1; name = "J.R.R. Tolkien" }
is serialized as (bytes in decimal notation plus ASCII characters between quotes) this 21-byte message:
001 019 002 000 002 003 014 "J.R.R Tolkien"
The code generated by extprotc allows you to manipulate such messages as any normal value. For instance, in the Ruby target (in progress as of 2009-01-14), you'd do:
# writing
puts "About to save record for user #{user.name}"
user.write(buf)
# save buf
# reading
user = User.read(io)
puts "Got user #{user.id} #{user.name}"
In OCaml, the message is simply a record:
let u = User.io_read_user stream in printf "User %S has got id %d\n" u.name u.id
Data types and allowed extensions
extprot supports an assortment of primitive types (bool, byte, float, string, and signed int and long) as well as composed types (lists, arrays, tuples and disjoint unions):
type color = Red | Black (* a disjoint union *)
type coords = (float * float * float) (* a tuple *)
message sphere = { coords : coords; color : color }
message triangle = { vertices : [| coords |] } (* array of tuples *)
type age = Unknown | Known int (* if known, age in years *)
message user = { name : string; age : age }
Possible changes to a protocol while keeping backward (and often forward) compatibility include:
adding fields to a message
adding elements to a tuple
adding cases to a disjoint union
promoting a primitive type
xto a composed type whose first element is of typexconverting a tuple into a message (i.e. giving a name to the elements of the tuple) and vice-versa
Implementation status
The extprotc compiler is written in OCaml, allowing new targets to be added without undue effort (ML excels at the symbolic processing performed by compilers and code generators).
As of 2009-01-15, the OCaml target is usable and the Ruby one is in progress. The code can be obtained from extprot's git repository.
Comments
What about bin_prot? It's OCaml and supports sum types!
It's also seen use for lightweight protocol versioning.
Fred, bin_prot indeed supports pretty much all OCaml data types (including recursive types and polymorphic variants), is fast and convenient since all you have to do is add
with bin_ioto the type definition to have the marshallers defined for you. It presents some problems regarding interoperability and extensibility, though:the encoding is neither self-delimited nor self-describing, so you absolutely need the original type definition to deserialize data; there's no way to interpret the byte stream otherwise.
the set of supported types is too rich. Polymorphic variants, for instance don't map to (any) other languages.
Jane Street's blog entry on lightweight protocols suggests that all nodes be able to communicate with different protocol versions (using semi-manual conversion functions) and that the protocol be negotiated when communication is established. This solution is not serviceable when you are serializing data to disk because you cannot go back in time to ask whoever generated it to use a different protocol :).
Here's a simple example of a safe protocol extension that keeps both forward and backward compatibility, trivially supported by extprot but impossible with bin_prot: suppose we have this record
message user = { name : string }We later decide that a user can have an email, so the protocol is extended:
type email = No_email | Some_email string message user = { name : string; email : email }What will happen when a new client reads old data is that the
emailfield will be set toNo_email. An old client reading new data will simply ignore the extra field.This sort of extensions is well supported by extprot, Protocol Buffers or Thrift, but impossible with bin_prot. (In fact, it's the only extension mechanism in Protocol Buffers and Thrift besides a couple low-level and ad-hoc ones, but only one of the several supported by extprot.)
PS: this could also be written using a reusable polymorphic type:
type maybe 'a = Unset | Set 'a (* this is a polymorphic type *) message user = { name : string; email : maybe<string> }I mostly agree with you about bin_prot, but you didn't mention it in your post so it wasn't clear if you had considered it. The NIH downside of rolling your own is quite small, considering the non-interoperable nature of bin_prot and its lack of use outside Jane Street. There are also Ocaml datatypes it doesn't support: bignums for example.
The conversion stuff is not quite as bad as you think, as the reader does the conversion, and I think the protos are version-tagged. I'm pretty sure Jane Street uses it extensively for long-term storage.
On the other hand, in conversation with Yaron Minsky he blanched when I used the 'pattern' epithet to describe their versioning scheme. It's so ad-hoc that it doesn't even have a name!
You should probably put a note in your template mentioning the markup language that comments will be parsed as :)
This is very cool. Now the challenge is to extend it to C/C++!
What luck! I had been using an awful little marshalling scheme in Potion to store bytecode. I dropped my study of protocol buffers because it lacked self-description and hadn't found anything else promising -- other than swiping another language's marshalling format.
This looks exceptional, though, and fits Potion's ideals very nicely. And I'm tickled that I get to thieve some more Fernandez-styled inventions. A beloved pasttime of mine.
Fred, I think that version tagging (and message delimitation, FWIW) isn't done within the bin_prot protocol, so you have to wrap bin_prot messages manually. (The messages use a custom markup language based on simplified Markdown, where __ delimits underlined text; \ can be used to escape any character, so bin\_prot turns into bin_prot.)
Dan, I've done some preliminary work on this. The first step is to hand-generate the code for a small example (first the type declarations, then the writer and finally the reader).
_why, long time :) I see you have a small serialization format in
core/compile.c, with some versioning info (PNBHeader) and composed types (WRITE_TUPLE) that could indeed benefit fromextprot's encoding.Even though I have begun some work on the C target, it will take a while to complete. That said, the encoding can also be used in a hand-written serializer. In other words, if you're going to do one by hand, you might as well use extprot's format, which shouldn't take much more effort if the data types are not too complex --- the hard part is to make the thing work for arbitrarily complex types.
You're right, using extprot for all marshalling would be a bit more work, but it would also be pretty nice, since I'd imagine extprot is great for random access, which is unusual in a serialization language. I will be okay without any formal bindings, since I'll only be using a few of the built-in types.
Since all values are self-delimited (either because they are of fixed size or because their length is included in their header), you can decode only the parts of a message you're interested in. For instance, if you want to access the last field of a structure, you can quickly skip all the preceding ones. This differs starkly from Thrift, whose default encoding uses a terminator to signal the end of a structure, forcing you to go through unwanted fields.
This doesn't mean you can access an arbitrary field in O(1) time, though, as it's still O(n), but with a potentially much better constant (in particular, if you're skipping a few large values, instead of many smaller ones). It is possible to "decorate" a data definition to support O(1) random access, however. Say we have
message foo = { field1 : int; field2 : string; ... fieldN : whatever }and we want O(1) access to any field. We decorate the type this way
message foo_with_offsets = { offsets : [| long |]; foo : foo }i.e., we simply add the field offsets as the first element. Since the offsets are of fixed size (64 bits here), the offset for the n-th element can be obtained in constant time (instead of having to decode integers of variable size, which would make it O(n) again).
A hand-written decoder sounds easy to build if you don't implement all the "safe extension" rules in case it only handles a specific schema. I've written a generic decoder as a Ruby extension in C, and in takes ~200 lines overall, with only <100 LoC for the main loop.
I wonder if there is any benefit to adding a method/option to 'shuffle' the encoded data, and groups of the data, as in HDF5: http://www.hdfgroup.org/HDF5/doc_resource/H5Shuffle_Perf.pdf
This would probably be of interest in only some use cases/message formats, but could offer a lot of benefit in cases where network volume matters and compression is worth the cost.
Mark
Mark, AFAICS shuffling applies mainly to numerical data, which HDF is optimized for. At the moment, there's no special provision in extprot for large, homogeneous (numeric) arrays: it's thus relatively inefficient at that, as Protocol Buffers or Thrift. Of course, it's always possible to pack numeric data into a string (and you can specify a conversion function used to expand it automatically when the value is read; this is implemented in the OCaml target), using an arbitrary shuffling scheme for the sake of better compression.
Fortunately, there are still several unallocated wire types in the low-level encoding, so support for homogeneous arrays can be added in time if people start using extprot for numerical data.
Sorry I should have been much clearer, I had in mind the use case where several identically formatted messages are 'collated/collected' then encoded, then shuffled. Apologies if the following example is not 'correct' - I use Ruby but have been impressed by rocaml's performance, I just haven't got to use it yet :(
In some use cases the message data may easily be int. The example I had in mind was: Server sends client a mapping message:
{1= "hello"; 2 = "world"; 3 = "wide"; 0 = "NULL"}Then then server sends the data:
{ first_word = 1; second_word = 2; third_word=0 } { first_word = 1; second_word = 3; third_word=2 } { first_word = 1; second_word = 2; third_word=3 } { first_word = 1; second_word = 1; third_word=1 }Maybe this wouldn't benefit too much from a shuffle which is then compressed?
I must firstly apply the disclaimer that I'm passing over this in my first pass morning e-mail + rss scan, so I have yet to read the code.
I've been using Marshal, YAML, and JSON in ruby for a few years now, which are relatively lightweight formats when used in the manner I've been using them. (Simple object schemes, built of hashes, arrays, and common 'native' types).
Performance is more than acceptable for a lot of distributed work (ruby style stuff), and the implementations are very simplistic (recently experimental API style here: http://github.com/raggi/object_protocol/tree/master).
I had a look at protocol buffers, but the thing that seems to concern me with most of these protocol generators is the wire format. YAML, JSON and Marshal are relatively lightweight in terms of wire protocols, and can also be relatively effectively compressed for larger streams, and also carry a reasonably light overhead for smaller packets.
There are various things not covered by the implementations at this point which would be desirable, such as channelling, but these features can be easily encoded into a single channel protocol on the next layer up.
Something like the sized_header_protocol i'm presently using (mainly because it matches ruby's own drb wire format) introduces a few limits and wastes a byte or two for many more trivial uses. Covering these edge cases in any manner adds significantly to the runtime, though. Practicality wise, I've yet to meet anyone who can honestly tell me they need to send 4Gb packets at that layer, though.
Disregarding specific implementation limitations, validation, and other nice-to-haves, what I always find annoying is the lack of real justification for any particular choice of protocol. I have tried very hard to come up with some reasonable ideas as to what actually makes for a good wire format, or what makes for a good protocol, but the practice has yet to lead to a strong conclusion in any direction. Even very verbose protocols can be sufficiently performant that bandwidth becomes a concern before any serialisation / deserialisation strategy expires CPU capabilities, excepting maybe REXML from the ruby camp.
My liking for simpler formats like YAML/JSON are simply that the construct for the protocol is so trivial that the contents of packets are actually readable. The wastage in the transport is merely the unused header space (in common apps), and the unused character space (in many apps). I spend a few minutes running down the route of thinking that should be sorted out, or some idea of having a sideband channel that uses these 'spare bits', when all of a sudden I lose transport features, like readability, portability, and implementation complexity. I start to feel like I'm implementing a tightly squeezed protocol on top of some dodgy old CDMA, or something equally horrific.
The popular protocols to date seem to be those which have a relatively readable wire format, for better or worse. I'm interested to hear more opinions on the matter, as it seems designing protocol wire formats and transport APIs is not as common as it once was, and the current tendency to use 'common' serializers seems to be pretty arbitrary.
The idea of a standardized normalization format may be more attractive than an actual wire format. That is, lets examine XML for a moment, horror that it may be.
Provides many unused validation tools
Rigorous document structure
Readable during transport
Consistent
Unframed
To make XML in any way efficient, various compression methodologies can be used. I've done work in the past with companies that run custom compressors over arbitrary protocols, and they had some neat shared dictionary tools for XML based and embeded protocols. It turns out that something horrifically explicit like XML is very compressible - and this is where the normalization thing comes in.
In order for a protocol to be 'readable' it must come out of our sniffing / dumping tools in some format that we can approach, and get to the data. At the protocol level (forgetting for a second, application specific data), designers care about overhead. This is where XML really starts to fall down with it's verbose nature.
Consider the following document extract:
Assuming two communicating systems adhere to a particular normalisation standard, and the general 'discussion' (communications channel), carries a loosely typed 'context' (say, the "tig" namespace in the above example), a simple normalisation can be utilised to massively reduce overhead. Potential for output can be reduced to as little as (or quite a few bits less than):
The requirement for any normalisation strategy is schema knowledge, or consistent full-duplex algorithms.
Commonly I hear the claim that schema knowledge requires pre-shared information, however, I do not believe this is the case. Compression schemes regularly do this without such meta data being available. All that is required in reality is a consistent behaviour and prediction system. For really extreme normalisation, some dynamic exchange and overhead may be helpful (such as to achieve the example above), however, gains over a format like XML can be made in the third packet, even for very small packets and short(ish) name spaces. When to use normalisation strategies can be a hard decision also. Normalisation is most effective as the node count or node metadata become very large. Dynamically selecting some 'mode' of normalisation is entirely possible, but absolutely more complex, especially to read off the wire. There can be other benefits however, such as gaining hardening against statistical attacks on encrypted channels, and so on.
For the record, object_protocol sitting on top of sized_header_protocol, although lacking protocol version, and implementation metadata, have the following on-the-wire sizes for the simple :id => 1, :name => 'J.R.R. Tolkien' example:
So, 1.5 to 2 times the size of extprot at this level (low node count). Similarly the readability for YAML and JSON is far above and beyond what you can expect from more binary protocols, and simple ZLib compression can take out quite a bit of overhead in many trivial cases.
Anyway, just food for thought / a quick brain dump. I'll be keeping an eye on this, it looks interesting :-)
Mark: I don't see how shuffling would help in that particular example. It seems rather that a caching convention like that suggested by Joe Armstrong is at play here. Even though the encoding could be modified to support it natively, I'd wait to see if such a thing is actually being used as an end-to-end mechanism (meaning that extprot doesn't directly know about it) before making extprot any more complex.
raggi: wow, I'll need some time to digest this :)
mfp: Fair comment. I had previously seen Joe's caching suggestion, and it was this which made me think of a shuffle-encoding (hidden from the user). I took Joe's comment to be aimed at the '*_word' labels, which could be mapped backed, still leaving the matrix 'data-of-interest' to be shuffled - per hdf5. As you correctly point out this may well be too narro an edge case. Two I can think of though are the classical 'stock-price-stream', but more realistically: production logging applications where a log-dictionary may be expected to be defined and also where wire volume and speed are real issues (you can never have too much log info :)), cf. Facebook's description of reasons for their architecture choices/decisions behind scribe server. Seems to me that this use-case is where extprot can have further impact. :) Love it anyway and look forward to using it. Thanks for all the effort.