I have optimized a bit the universal pure-Ruby extprot deserializer, and written a new one as a C extension. The former now approaches the speed of Marshal.load in my simple tests (quite remarkable because this is Ruby vs. C) and is now down to 70 lines of code. The extension is over 3 times faster than Marshal.load and takes a bit over 200 lines.

Serialization size (bytes) Deserialization time
Marshal.load(from string) (Ruby core method in C) 3527449 1.29s
Marshal.load(from IO) 3527449 1.65s
pure Ruby extprot decoder, from IO 1859128 1.85s
Ruby extprot decoder (extension), from IO 1859128 0.48s

Optimizing dispatching in Ruby

The core of the decoder is a loop that reads the prefix of the next value and then calls the appropriate subroutine to decode it. In the original Ruby decoder, this looked like

def read_value
  prefix = read_prefix
  tag = ll_tag(prefix)
  case ll_type(prefix)
  when :tuple; t = read_htuple(prefix); r = Tuple.new(t); r.tag = t.tag; r
  when :htuple; read_htuple(prefix)
  when :assoc; read_assoc(prefix)
  when :vint; n = read_vint; (n >> 1) ^ -(n & 1)
  when :bits8; @io.readchar
  when :bits32; @io.read(4).unpack("V")[0]
  when :bits64_long; @io.read(8).unpack("q")[0]
  when :bits64_float; @io.read(8).unpack("E")
  when :enum; Enum.new(tag)
  when :bytes; len = read_vint; @io.read(len)
  when :invalid; raise BadWireType
  end
end

While the wire type can be obtained in constant time (just have to index an array), the case expression requires linear time to find the appropriate branch, since it will compare to the possible values (:tuple, :htuple, etc.) in order (moreover, this involves an extra method call, #===, per comparison). Essentially, what I'd like to do is to create an array of Procs and call the appropriate one. This is unfortunately very slow, so instead I make use of Ruby's native dispatching (i.e., method dispatching) this way:

class Vint < Base; def self.read(_, io); n = read_vint(io); (n >> 1) ^ -(n & 1) end end
class Bits8 < Base; def self.read(_, io); io.readchar end end
...

LL_TYPES = [
  Vint, Tuple, Bits8, Bytes, Bits32, HTuple, Bits64_long, Assoc,
  Bits64_float, Invalid, Enum, Invalid, Invalid, Invalid, Invalid, Invalid,
]

def self.read_value(io)
  prefix = read_prefix(io)
  LL_TYPES[prefix & 0xf].read(prefix, io)
end

The extension

Here's the gist of the C extension, some 100 lines that could easily adapted for another target:

static unsigned char *
do_read_value(unsigned char *ptr, unsigned char *end, VALUE *dst)
{
 unsigned int prefix, n, tag, len, nelms, wtype;

#define Read_vint_check(dst)  \
   do { ptr = read_vint(ptr, end, dst); if(!ptr) return NULL; } while(0);

 if(!ptr || ptr >= end) return NULL;
 ptr = read_vint(ptr, end, &prefix);
 if(!ptr) return NULL;
 tag = prefix >> 4;
 wtype = prefix & 0xF;
 switch(wtype) {
     case VINT:
         Read_vint_check(&n);
         *dst = INT2NUM((int)n >> 1 ^ -(n & 1));
         break;
     case BITS8:
         *dst = INT2FIX(*ptr++);
         break;
     case BITS32:
         if(ptr + 4 > end) return NULL;
         /* FIXME: endianness */
         *dst = INT2NUM(*(int32_t *) ptr);
         ptr += 4;
         break;
     case BITS64_LONG:
         if(ptr + 8 > end) return NULL;
         {
             /* FIXME: endianness */
             VALUE l, h;
             l = INT2NUM(*(int32_t *) ptr);
             h = INT2NUM(*(int32_t *) (ptr + 4));
             *dst = rb_funcall(l, id_or, 1,
                               rb_funcall(h, id_shl, 1, INT2FIX(32)));
             ptr += 8;
         }
         break;
     case BITS64_FLOAT:
         if(ptr + 8 > end) return NULL;
         /* FIXME: endianness */
         *dst = rb_float_new(*(double *)ptr);
         ptr += 8;
         break;
     case ENUM:
         *dst = rb_obj_alloc(rb_Enum_c);
         rb_iv_set(*dst, "@tag", INT2FIX(tag));
         break;
     case TUPLE:
     case HTUPLE:
         Read_vint_check(&len);
         if(bound_error(ptr, end, len)) return NULL;
         end = ptr + len;
         Read_vint_check(&nelms);
         *dst = rb_ary_new2(nelms);
         rb_iv_set(*dst, "@tag", INT2FIX(tag));
         RBASIC(*dst)->klass = (wtype == TUPLE) ? rb_Tuple_c : rb_HTuple_c;
         for(n = 0; ptr && (n < nelms); n++) {
             /* we enlarge the array before reading the val because the GC
              * might be triggered */
             RARRAY(*dst)->len++;
             RARRAY(*dst)->ptr[n] = Qnil;
             ptr = do_read_value(ptr, end, RARRAY(*dst)->ptr + n);
             if(!ptr) return NULL;
         };
         break;
     case BYTES:
         Read_vint_check(&len);
         if(bound_error(ptr, end, len)) return NULL;
         *dst = rb_str_new(ptr, len);
         ptr += len;
         break;
     case ASSOC:
         Read_vint_check(&len);
         if(bound_error(ptr, end, len)) return NULL;
         end = ptr + len;
         Read_vint_check(&nelms);
         *dst = rb_hash_new();
         rb_set_iv(*dst, "tag", INT2FIX(tag));
         RBASIC(*dst)->klass = rb_Assoc_c;
         for(n = 0; ptr && n < nelms; n++) {
             VALUE k, v;
             ptr = do_read_value(ptr, end, &k);
             if(!ptr) return NULL;
             ptr = do_read_value(ptr, end, &v);
             if(!ptr) return NULL;
             rb_hash_aset(*dst, k, v);
         };
         break;
     default:
         rb_raise(rb_eRuntimeError, "Unknown wire type.");
 }

 return ptr;
#undef Read_vint_check
}


Comments

  1. Any idea why your array of procs idea was slow?

    fREW, 29 January 2009 at 16:53#
  2. fREW, it's just that calling a Proc is slow. The interpreter needs an eternity to invoke a closure:

    require 'benchmark'
    
    TIMES = 2_000_000
    
    p = lambda{||}
    def foo; end
    Benchmark.bm(10) do |bm|
      bm.report("meth"){ TIMES.times{ foo; foo; foo; foo; foo } }
      bm.report("Proc#call"){ TIMES.times{ p[]; p[]; p[]; p[]; p[] } }
    end
    
    # >>                 user     system      total        real
    # >> meth        1.610000   0.000000   1.610000 (  1.632122)
    # >> Proc#call   4.110000   0.000000   4.110000 (  4.125201)
    
    mfp, 29 January 2009 at 19:16#
  3. Could these optimizations also be applied to Ruby/Thrift?

    Evan, 02 February 2009 at 05:23#
  4. Evan, I think so, as it is using naïve O(n) dispatching atm.:

    def read_type(type)
      case type
      when Types::BOOL
        read_bool
      when Types::BYTE
        read_byte
      when Types::DOUBLE
        read_double
      ...
    

    I can see several other possible bottlenecks, the most apparent one being that a string is allocated whenever a value is read, e.g., whenever it reads a bool (= reading one byte from the input stream), a 1-byte string is created. Since Ruby uses malloc to allocate the buffer, this is slow and bound to lead to heap fragmentation.

    mfp, 02 February 2009 at 10:59#
  5. Forgot to add that string allocation should not be as problematic in Ruby 1.9 since it uses a trick to pack small strings (up to 11 bytes on x86, 23 on x64) into the object slot, without an external buffer. It's still slower than needed, though, and could cause the object table to grow in size (which again can lead to heap fragmentation).

    mfp, 02 February 2009 at 11:45#