I have optimized a bit the universal pure-Ruby extprot deserializer, and written a new one as a C extension. The former now approaches the speed of Marshal.load in my simple tests (quite remarkable because this is Ruby vs. C) and is now down to 70 lines of code. The extension is over 3 times faster than Marshal.load and takes a bit over 200 lines.
| Serialization size (bytes) | Deserialization time | |
|---|---|---|
| Marshal.load(from string) (Ruby core method in C) | 3527449 | 1.29s |
| Marshal.load(from IO) | 3527449 | 1.65s |
| pure Ruby extprot decoder, from IO | 1859128 | 1.85s |
| Ruby extprot decoder (extension), from IO | 1859128 | 0.48s |
Optimizing dispatching in Ruby
The core of the decoder is a loop that reads the prefix of the next value and then calls the appropriate subroutine to decode it. In the original Ruby decoder, this looked like
def read_value
prefix = read_prefix
tag = ll_tag(prefix)
case ll_type(prefix)
when :tuple; t = read_htuple(prefix); r = Tuple.new(t); r.tag = t.tag; r
when :htuple; read_htuple(prefix)
when :assoc; read_assoc(prefix)
when :vint; n = read_vint; (n >> 1) ^ -(n & 1)
when :bits8; @io.readchar
when :bits32; @io.read(4).unpack("V")[0]
when :bits64_long; @io.read(8).unpack("q")[0]
when :bits64_float; @io.read(8).unpack("E")
when :enum; Enum.new(tag)
when :bytes; len = read_vint; @io.read(len)
when :invalid; raise BadWireType
end
end
While the wire type can be obtained in constant time (just have to index an array), the case expression requires linear time to find the appropriate branch, since it will compare to the possible values (:tuple, :htuple, etc.) in order (moreover, this involves an extra method call, #===, per comparison). Essentially, what I'd like to do is to create an array of Procs and call the appropriate one. This is unfortunately very slow, so instead I make use of Ruby's native dispatching (i.e., method dispatching) this way:
class Vint < Base; def self.read(_, io); n = read_vint(io); (n >> 1) ^ -(n & 1) end end class Bits8 < Base; def self.read(_, io); io.readchar end end ... LL_TYPES = [ Vint, Tuple, Bits8, Bytes, Bits32, HTuple, Bits64_long, Assoc, Bits64_float, Invalid, Enum, Invalid, Invalid, Invalid, Invalid, Invalid, ] def self.read_value(io) prefix = read_prefix(io) LL_TYPES[prefix & 0xf].read(prefix, io) end
The extension
Here's the gist of the C extension, some 100 lines that could easily adapted for another target:
static unsigned char *
do_read_value(unsigned char *ptr, unsigned char *end, VALUE *dst)
{
unsigned int prefix, n, tag, len, nelms, wtype;
#define Read_vint_check(dst) \
do { ptr = read_vint(ptr, end, dst); if(!ptr) return NULL; } while(0);
if(!ptr || ptr >= end) return NULL;
ptr = read_vint(ptr, end, &prefix);
if(!ptr) return NULL;
tag = prefix >> 4;
wtype = prefix & 0xF;
switch(wtype) {
case VINT:
Read_vint_check(&n);
*dst = INT2NUM((int)n >> 1 ^ -(n & 1));
break;
case BITS8:
*dst = INT2FIX(*ptr++);
break;
case BITS32:
if(ptr + 4 > end) return NULL;
/* FIXME: endianness */
*dst = INT2NUM(*(int32_t *) ptr);
ptr += 4;
break;
case BITS64_LONG:
if(ptr + 8 > end) return NULL;
{
/* FIXME: endianness */
VALUE l, h;
l = INT2NUM(*(int32_t *) ptr);
h = INT2NUM(*(int32_t *) (ptr + 4));
*dst = rb_funcall(l, id_or, 1,
rb_funcall(h, id_shl, 1, INT2FIX(32)));
ptr += 8;
}
break;
case BITS64_FLOAT:
if(ptr + 8 > end) return NULL;
/* FIXME: endianness */
*dst = rb_float_new(*(double *)ptr);
ptr += 8;
break;
case ENUM:
*dst = rb_obj_alloc(rb_Enum_c);
rb_iv_set(*dst, "@tag", INT2FIX(tag));
break;
case TUPLE:
case HTUPLE:
Read_vint_check(&len);
if(bound_error(ptr, end, len)) return NULL;
end = ptr + len;
Read_vint_check(&nelms);
*dst = rb_ary_new2(nelms);
rb_iv_set(*dst, "@tag", INT2FIX(tag));
RBASIC(*dst)->klass = (wtype == TUPLE) ? rb_Tuple_c : rb_HTuple_c;
for(n = 0; ptr && (n < nelms); n++) {
/* we enlarge the array before reading the val because the GC
* might be triggered */
RARRAY(*dst)->len++;
RARRAY(*dst)->ptr[n] = Qnil;
ptr = do_read_value(ptr, end, RARRAY(*dst)->ptr + n);
if(!ptr) return NULL;
};
break;
case BYTES:
Read_vint_check(&len);
if(bound_error(ptr, end, len)) return NULL;
*dst = rb_str_new(ptr, len);
ptr += len;
break;
case ASSOC:
Read_vint_check(&len);
if(bound_error(ptr, end, len)) return NULL;
end = ptr + len;
Read_vint_check(&nelms);
*dst = rb_hash_new();
rb_set_iv(*dst, "tag", INT2FIX(tag));
RBASIC(*dst)->klass = rb_Assoc_c;
for(n = 0; ptr && n < nelms; n++) {
VALUE k, v;
ptr = do_read_value(ptr, end, &k);
if(!ptr) return NULL;
ptr = do_read_value(ptr, end, &v);
if(!ptr) return NULL;
rb_hash_aset(*dst, k, v);
};
break;
default:
rb_raise(rb_eRuntimeError, "Unknown wire type.");
}
return ptr;
#undef Read_vint_check
}
Comments
Any idea why your array of procs idea was slow?
fREW, it's just that calling a
Procis slow. The interpreter needs an eternity to invoke a closure:require 'benchmark' TIMES = 2_000_000 p = lambda{||} def foo; end Benchmark.bm(10) do |bm| bm.report("meth"){ TIMES.times{ foo; foo; foo; foo; foo } } bm.report("Proc#call"){ TIMES.times{ p[]; p[]; p[]; p[]; p[] } } end # >> user system total real # >> meth 1.610000 0.000000 1.610000 ( 1.632122) # >> Proc#call 4.110000 0.000000 4.110000 ( 4.125201)Could these optimizations also be applied to Ruby/Thrift?
Evan, I think so, as it is using naïve
O(n)dispatching atm.:def read_type(type) case type when Types::BOOL read_bool when Types::BYTE read_byte when Types::DOUBLE read_double ...I can see several other possible bottlenecks, the most apparent one being that a string is allocated whenever a value is read, e.g., whenever it reads a
bool(= reading one byte from the input stream), a 1-byte string is created. Since Ruby usesmallocto allocate the buffer, this is slow and bound to lead to heap fragmentation.Forgot to add that string allocation should not be as problematic in Ruby 1.9 since it uses a trick to pack small strings (up to 11 bytes on x86, 23 on x64) into the object slot, without an external buffer. It's still slower than needed, though, and could cause the object table to grow in size (which again can lead to heap fragmentation).