As I explained yesterday, the caml_modify write barrier needed by OCaml's incremental + generational GC is the main reason why immutable data can often be faster than mutable state. I've played a little with it and achieved a >50% speed boost in synthetic benchmarks, which could map to up to two-digit percentage gains in actual programs (in particular, it can make Array.map, init, make and the likes over 20% faster). The mutable state part from the synthetic business entity update benchmark becomes 32% faster.

How the write barrier works

caml_modify needs to record the references from values in the major heap to the minor heap. The code is straightforward:

#define Modify(fp, val) do{                                                 \
  value _old_ = *(fp);                                                      \
  *(fp) = (val);                                                            \
  if (Is_in_heap (fp)){                                                     \
    if (caml_gc_phase == Phase_mark) caml_darken (_old_, NULL);             \
    if (Is_block (val) && Is_young (val)                                    \
        && ! (Is_block (_old_) && Is_young (_old_))){                       \
      if (caml_ref_table.ptr >= caml_ref_table.limit){                      \
        CAMLassert (caml_ref_table.ptr == caml_ref_table.limit);            \
        caml_realloc_ref_table (&caml_ref_table);                           \
      }                                                                     \
      *caml_ref_table.ptr++ = (fp);                                         \
    }                                                                       \
  }                                                                         \
}while(0)

fp is the block slot where the (suspected) pointer is to be stored. caml_modify first checks if the destination is in the major heap with the Is_in_heap macro I'll come back to in a minute. Is_block(v) is true when v is not an immediate value. There's an interesting piece of logic that tries to avoid recording the same reference twice: if the previous reference _old_ held in fp was to a block in the minor heap, we know it's been already added to the ref table and there's no need to do it again. In the worst case, when updating N times the same slot with values in the minor and the major heap alternatingly, the reference will be recorded N/2 times --- it seems very unlikely this would happen in practice, though, as the ref table is cleared on each minor GC run.

The main problem with the above routine is that Is_in_heap is fairly slow. On 64-bit architectures it involves a lookup in a sparse hash table to see if the corresponding page is being managed by OCaml (on 32 bit, a two-level array is used). The reason why this check cannot just be implemented as a negated "is it in the minor heap" test (two pointer comparisons at most) is that it is possible to have blocks which belong neither to the minor heap nor to the major one --- iow., memory areas not managed by OCaml at all, and which are not to be traversed by the GC.

Speeding it up

When we're not in the Phase_mark GC phase, the order of the checks can be altered so the expensive Is_in_heap test is only performed when the value to be stored resides in the minor heap and the old one didn't before. This way, when the same slot is being updated several times, Is_in_heap will only be used once (with the exception explained above).

The resulting code, albeit quite ugly because of the code duplication, is considerably faster. I measure the gain with two trivial programs:

let () =
  let a = Array.make 1_000 0 in
    for i = 1 to 100000 do
      ignore (Array.map (fun i -> i) a)
    done

and

open Printf

let time f x =
  let t0 = Unix.gettimeofday () in
  let y = f x in
    Printf.printf "Needed %8.5fs\n" (Unix.gettimeofday () -. t0);
    y

let ntimes n f x = for i = 1 to n do f x done

type silly = A | B of int

let () =
  let f a =
    for i = 1 to Array.length a - 1 do
      a.(i) <- B 0;
      a.(i) <- B 0;
      a.(i) <- B 0;
      a.(i) <- B 0;
      a.(i) <- B 0;
      a.(i) <- B 0;
      a.(i) <- B 0;
      a.(i) <- B 0;
      a.(i) <- B 0;
      a.(i) <- B 0;
    done
  in
    time (ntimes 100 f) (Array.make 1_000_000 A)

The latter is just meant to illustrate the isolate the effect of the write barrier, while the former measures the impact in higher-order functions with some overhead. Note that the array updates in the second one do not involve any allocation, as B 0 is evaluated once when the module is initialized. A block (i.e. non-immediate) value is needed here so that the compiler generates a caml_modify call --- it would know there's no need to if we were using A.

The times are 8.3s (for loop) and 1.66s (Array.map) for an unpatched OCaml 3.11.0, and 5.3s and 1.4s with a modified runtime using this routine:

#define Modify(fp, val) do{                                                 \
  value _old_ = *(fp);                                                      \
  *(fp) = (val);                                                            \
  if (caml_gc_phase == Phase_mark) {                                        \
    if (Is_in_heap (fp)) {                                                  \
      caml_darken (_old_, NULL);                                            \
      if (Is_block (val) && Is_young (val)                                  \
          && ! (Is_block (_old_) && Is_young (_old_))){                     \
        if (caml_ref_table.ptr >= caml_ref_table.limit){                    \
          CAMLassert (caml_ref_table.ptr == caml_ref_table.limit);          \
          caml_realloc_ref_table (&caml_ref_table);                         \
        }                                                                   \
        *caml_ref_table.ptr++ = (fp);                                       \
      }                                                                     \
    }                                                                       \
  } else {                                                                  \
      if (Is_block (val) && Is_young (val)                                  \
          && ! (Is_block (_old_) && Is_young (_old_)) &&                    \
          Is_in_heap (fp) ){                                                \
        if (caml_ref_table.ptr >= caml_ref_table.limit){                    \
          CAMLassert (caml_ref_table.ptr == caml_ref_table.limit);          \
          caml_realloc_ref_table (&caml_ref_table);                         \
        }                                                                   \
        *caml_ref_table.ptr++ = (fp);                                       \
      }                                                                     \
  }                                                                         \
}while(0)

I tried to avoid the code duplication as follows, but this proved to be slightly slower (5.6s for the for loop, no difference for Array.map):

#define Modify(fp, val) do{                                                 \
  value in_heap = 0;                                                        \
  value _old_ = *(fp);                                                      \
  *(fp) = (val);                                                            \
  if (caml_gc_phase == Phase_mark) {                                        \
    if (Is_in_heap (fp)) {                                                  \
      caml_darken (_old_, NULL);                                            \
      in_heap = 1;                                                          \
      goto caml_modify_maybe_add_to_ref_table;                              \
    }                                                                       \
  } else {                                                                  \
      caml_modify_maybe_add_to_ref_table:                                   \
      if (Is_block (val) && Is_young (val)                                  \
          && ! (Is_block (_old_) && Is_young (_old_)) &&                    \
          (in_heap || Is_in_heap (fp)) ){                                   \
        if (caml_ref_table.ptr >= caml_ref_table.limit){                    \
          CAMLassert (caml_ref_table.ptr == caml_ref_table.limit);          \
          caml_realloc_ref_table (&caml_ref_table);                         \
        }                                                                   \
        *caml_ref_table.ptr++ = (fp);                                       \
      }                                                                     \
  }                                                                         \
}while(0)

Comments

  1. The change seems indeed to speed up thing. Just a side question (possible optimization), what about checking that there something to modify before anything ?

    Beginning should change to #define Modify(fp, val) do{ if ((fp) != (val)) { value _old_ =(fp); *(fp) = (val);

    The point is that reading and checking is cheaper that writing. Moreover, processor branch predictor could optimize the test if everything is ok.

    Sylvain Le Gall, 29 May 2009 at 20:57#
  2. Sorry the code is awful

    #define Modify(fp, val) do{  \
      if (*(fp) != (val))        \
      {                          \
        value _old_ = *(fp);     \
        *(fp) = (val);
    
    Sylvain Le Gall, 29 May 2009 at 21:04#
  3. Wow, this looks precisely like what I need to speed-up a (real) function which visits and updates many times a mutable AST. I'll test your patch and provide the results here in the next few days.

    Gabriel, 29 May 2009 at 22:24#
  4. Only 2% speedup :-(

    I guess something else is killing my performances...

    Gabriel, 30 May 2009 at 09:08#
  5. Sylvain: indeed, that should only add a couple instructions to the main path; the branch doesn't cost anything most of the time, as it will be predicted as not taken, and the ~15 cycle penalty when it is is more than compensated by all the time we save by not doing anything else.

    Gabriel: what does gprof say about caml_modify? This is not going to make much of a difference unless it takes a sizable amount of time.

    In some programs of mine caml_modify represents over 30% of the execution time, so making it 50% faster represents a 10% speed boost for the overall program.

    mfp, 30 May 2009 at 10:05#
  6. IIRC (I don't have the figure right now), most of the time is spent in the GC. I improved things a lot thanks to eta-expansion and avoiding of partial applications (so that the caml_curry* dive deeply into gprof), but it didn't save that much time although the number of minor words decreased by 50%. And then I noticed caml_modify in my profile, and discovered what it meant in your blog posts. So I'm allocating too much memory, and spending too much time in the GC, but I can't find out more things to cut down.

    I must confess I've got pathological user-input: a C AST, in CIL, from a program using curl. I've learnt to hate curl's insane amount of crappy macros, producing a single line of code of 15kb. But then again, such an unexpected input is a good stress case to find the weak points of my module. And the part generating so much garbage and taking so much time is precisely the one mutating the AST a lot, hence my expectations.

    Anyway, if I ever find a way to speed-up this beast, I will write an article and let you know.

    Gabriel, 30 May 2009 at 19:13#
  7. For the record, I found a way to dramatically speed-up my program: doing a first pass which strips out every "if(0){ ... }" statement. Macros produced a lot of these, and I spent most of my time walking through such dead-code (more precisely, allocating temporary data related to the walk-through, and the GCing it).

    Gabriel, 02 June 2009 at 09:30#
  8. Fantastic! Now you just need to speed up copying of immutable records. ;) Anyway, thank you, I'd recently been wondering about the real costs incurred by all the copying of immutable records I tend to do. Incidentally, does this change effect Hashtbl as well?

    _why, 04 June 2009 at 14:58#
  9. _why: IIRC it made Hashtbl updates 10% faster or so. Some Array functions (sort, and especially copy and init) were improved by 15-20%.

    mfp, 17 June 2009 at 08:44#
  10. what is the used of mutable state and where it be running?

    anonymous, 16 July 2009 at 08:22#