eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Legitimate uses of micro-benchmarks: parameter passing and function call costs

Update [2007-12-03] Simon Marlow has annotated the Haskell assembly (thank you!); see his comment below.

So, everybody knows about the Fibonacci pissing contest by now. From the moment you mention two programming languages, this is likely to happen, and the blog post that triggered all this provided a small push sufficient to make the discussion degenerate that way*1. The chosen micro-benchmark also played a very negative role, as it gives too many opportunities to get sidetracked: better algorithms (recursion vs. iteration, memoization...), overflowable ints vs. arbitrary precision integers, etc.

You're probably expecting me to throw some code at you (OCaml since I've been talking about it a bit as of late) in spite of everything I wrote above and give enough figures to be able to claim something like "Holy Shmoly, OCaml smokes {Haskell, SBCL, C++} away". (That's truly a blogging classic, deploring the silliness of the latest blog trend and adding to it the second after.)

Instead, I'm going to use this micro-benchmark the way it was meant to: not to compare language implementations in general, but to measure an isolated characteristic. Micro-benchmarks are exceedingly useful for language implementors and for those who want to assess the cost of some particular operation.

What I want to do is build some intuition about the cost of function calls and argument passing in Haskell. I shall look at the generated assembly code in order to understand where my CPU cycles are going (this is where I'll be needing your help, because, as you will see, Haskell's asm isn't exactly easy to read).

Before I plunge into Haskell, I'll take a look at the code generated by GCC and OCaml, which will serve as the control group of sorts and help me verify that what I think I know about them is correct. If I can't understand what's going on with C and OCaml, I have no chance with Haskell. (Yes, I will also give you execution times, but I promise there will be no Holy Shmoly.)

Some familiar code first

This is what I expect in my first runs:

  1. argument passing should be faster in OCaml than in C with a non-static function
  2. GCC should be able to generate code about as fast (or faster) given enough inlining (OCaml never inlines recursive functions, recent GCC versions do)
  3. when using a static function (so that arguments are passed in registers instead of via the stack as required by the standard ABI), GCC's code should be as fast or faster than OCaml's

You can jump to the next section if you're not interested in the details of the code generated by GCC and OCaml. In few words, my intuition regarding C and OCaml was basically correct, and I now want to see what causes the Haskell figures in this table (this is where I'll need some help):

implementationexecution time(s)
GCC -O20.444
GCC -O30.405
OCaml0.413
Haskell (Int)1.03

Here goes the OCaml code and the corresponding assembly, which can hardly get more readable:

let rec fib = function
    0 | 1 as x -> x
  | n -> fib (n-1) + fib(n-2)

let () =
  for i = 0 to 35 do
    Printf.printf "n=%d => %d\n" i (fib i)
  done

camlCamlfib__fib_58:
        subl    $8, %esp
.L101:
        cmpl    $3, %eax
        jbe     .L100
        movl    %eax, 0(%esp)
        addl    $-4, %eax
        call    camlCamlfib__fib_58
.L102:
        movl    %eax, 4(%esp)
        movl    0(%esp), %eax
        addl    $-2, %eax
        call    camlCamlfib__fib_58
.L103:
        movl    4(%esp), %ebx
        addl    %ebx, %eax
        decl    %eax
        addl    $8, %esp
        ret
        .align  16
.L100:
        addl    $8, %esp
        ret

Both fib's argument and its return value are passed via %eax. The stack is used to hold n and fib(n-2). Note that OCaml uses the least significant bit to tag integers (so $3 stands for 1, and addl $-4, %eax computes n-2) and performs right-to-left evaluation.

This is the comparable C code:

#include <stdio.h>

int fib(int n)
{
  if(n == 0 || n == 1) return n;
  return fib(n-1) + fib(n-2);
}
 
int main(int argc, char *argv[])
{
 int i;
 for(i = 0; i < 36; i++)
     printf("n=%d => %d\n", i, fib(i));
}

The assembly generated with -O2:

fib:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %esi
        xorl    %esi, %esi
        pushl   %ebx
        subl    $4, %esp
        movl    8(%ebp), %ebx
        cmpl    $1, %ebx
        jbe     .L5
        .p2align 4,,7
.L6:
        leal    -1(%ebx), %eax
        subl    $2, %ebx
        movl    %eax, (%esp)
        call    fib
        addl    %eax, %esi
        cmpl    $1, %ebx
        ja      .L6
.L5:
        leal    (%ebx,%esi), %eax
        addl    $4, %esp
        popl    %ebx
        popl    %esi
        popl    %ebp
        ret

I notice two things:

  • as expected, there's a fair amount of stack operations
  • a bit surprisingly, GCC is using a loop (executed 2 times) to compute first fib(n-1) and then fib(n-2) I misread that completely, GCC is being really clever here. See psykotic's comment

Again as expected, this is slower than OCaml's code. Results that confirm our conjectures always feel good (there's danger in that, though); better keep them in mind when we get negative or, worse, non-conclusive ones.

With -O3, the fib function takes over 200 instructions, as GCC expand it up to 4(?) levels deep (fib-O3.s). In this case inlining is a net gain, but it's not always the case since such code explosion could blow the instruction cache in a larger program.

Function calls and argument passing in Haskell

Here's the Haskell code I timed:

import Control.Monad (forM_)
import Text.Printf (printf)
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
main = forM_ [0..35] $ \i ->
            printf "n=%d => %d\n" i (fib i)

The fib function is annotated as Int -> Int, so that native integers are used and it can be meaningfully compared to the GCC and OCaml solutions.

The Haskell code is about 2.5 times slower than C/OCaml when compiled with

ghc -O2 -optc-O2 --make hsfib.hs

(-O3 is slightly slower for some reason). Here's where I turn to the assembly, expecting to see some thunks, Haskell idiosyncrasies regarding register usage and argument passing, something. Unfortunately, I have a very hard time reading the assembly (hsfib.s). It starts like this:

r1xp_info:
.text
        .type   r1xp_info, @function
        leal    -8(%ebp), %eax
        movl    %ebp, %edx
        cmpl    84(%ebx), %eax
        jb      .L14
        movl    (%ebp), %eax
        cmpl    $0, %eax
        jle     .L16
        cmpl    $1, %eax
        je      .L23
.L18:
        movl    %eax, (%edx)
        subl    $1, %eax
        movl    %eax, -8(%ebp)
        movl    $s1xG_info, -4(%ebp)
        subl    $8, %ebp
        jmp r1xp_info
        .p2align 4,,7
.L14:
        movl    $r1xp_closure, %esi
        jmp     *-4(%ebx)
.L23:
        movl    $1, %esi
        leal    4(%ebp), %ebp
        movl    (%ebp), %eax
        jmp     *%eax

I can see something that looks like closures (but there's no allocation on the heap, it all happens in the stack) and some tail calls. The control flow is really hard to follow, so it's hard to get much information out of this. Arguments seem to be passed through the stack to r1xp_info, but other subroutines(?) like s1xG_info use both registers and the stack.

Does anybody have some clues as to how to read such assembly code, or is it not supposed to be inspected, even if it's just to build some intuition about abstraction costs and the performance of Haskell programs?



Try giving Haskell some strictness annotations - ephemient (2007-11-30 (Fri) 06:23:37)

The strictness analyzer in GHC's optimizer correctly judges that fib is strict, so it's not generating thunks for fib(n-1) and fib(n-2). Otherwise this would be even harder to read.

I'm on PowerPC Linux, so the assembly I see is very different from yours. For me, though, I have a marker named "Main_zdwfib_entry", and it actually looks similar to the x86 OCaml assembly you have.

Check to see if your GHC is using -fvia-C or -fasm; whether or not -O implies this depends on your architecture and GHC version. It looks like PowerPC uses -fvia-C, but I believe that x86 uses GHC's native code generator.

Jon Harrop 2007-11-30 (Fri) 06:39:39

Yeah, on my 4400+ Athlon64 running x86-64 Debian I get 0.315s for OCaml and 0.726s for Haskell with GHC 6.8 (with -O3 because it is slightly faster here).

However, you've used an or-pattern in the OCaml and not in the Haskell. Writing the OCaml the same way brings the time up to 0.411s. Hmm, Haskell doesn't seem to support or-patterns yet. Shame. Back to OCaml. :-)

zzy 2007-12-05 (Wed) 15:29:11

come

Name:
Comment:
TextFormattingRules
Preview:

Yeah, last title is wrong - ephemient (2007-11-30 (Fri) 06:31:18)

My first thought was "strictness", but then I checked with -ddump-simpl and saw that GHC gets it right. Then I forgot to fix the title.

Also, I looked at this on x86. Wow, it's a lot stranger than PowerPC assembly. I honestly have no idea what's going on... but I'm not a GHC expert by any means.

mfp 2007-11-30 (Fri) 07:19:07

It seems the above assembly was generated with -fvia-C (this is GHC 6.6). With -fasm, I get even more mystifying assembly, with no noticeable differences in speed. PowerPC vs. x86: scarcity of registers and convoluted instruction set?

Name:
Comment:
TextFormattingRules
Preview:

misreading of GCC assembly - psykotic (2007-11-30 (Fri) 09:52:16)

Your interpretation of what GCC's generated assembly is doing is pretty far off. I wanted to point out how by posting a commented version of the assembly code, but I wasn't sure if your blog's text formatting was up to the task, so I posted it at reddit instead:

http://programming.reddit.com/info/61tc0/comments/c02kcpp

mfp 2007-11-30 (Fri) 10:41:33

That's how I read it at first (one recursion being turned into a tail recursion, mainly because there one only one call fib and then the loop), but it seemed way too clever and I then somehow saw a movl $3, %ebx where there was none without caring to read the code completely, so I came up with that wrong interpretation.

PS: there's an ajaxy "preview" button

Name:
Comment:
TextFormattingRules
Preview:

Haskell assembly annotated - Sion Marlow (2007-12-03 (Mon) 02:22:18)

r1xp_info:
.text
       .type   r1xp_info, @function

This is a stack check. %ebp is the (Haskell) stack pointer, and 84(%ebx) is where the stack limit for this thread is stored:

       leal    -8(%ebp), %eax
       movl    %ebp, %edx
       cmpl    84(%ebx), %eax
       jb      .L14

the argument to fib is stored on the stack, because (sadly) GHC doesn't use any registers for argument passing on x86 right now. It does on x86_64 and powerpc, though, and will on x86 in the future. So load the argument off the stack:

       movl    (%ebp), %eax

comparisons against 0 and 1. GHC has actually done a pretty bad job of compiling these two comparisons, it is using a decision tree unnecessarily. I must look into that! (note that if you write fib assuming non-negative numbers, you can get away with just one comparison).

       cmpl    $0, %eax
       jle     .L16
       cmpl    $1, %eax
       je      .L23
.L18:

save n for the future call to fib (n-2):

       movl    %eax, (%edx)

store (n-1) on the stack for the next call to fib:

       subl    $1, %eax
       movl    %eax, -8(%ebp)

push the continuation:

       movl    $s1xG_info, -4(%ebp)
       subl    $8, %ebp

and recurse...

       jmp r1xp_info
       .p2align 4,,7
.L14:
       movl    $r1xp_closure, %esi
       jmp     *-4(%ebx)
.L23:
       movl    $1, %esi
       leal    4(%ebp), %ebp
       movl    (%ebp), %eax
       jmp     *%eax
Name:
Comment:
TextFormattingRules
Preview:

wo - zzy (2007-12-05 (Wed) 15:29:24)

come

Name:
Comment:
TextFormattingRules
Preview:

Use this form to create a new top-level comment; for direct replies to existing comments, use the text entries you'll find below.

Name:
Subject:

TextFormattingRules
Preview:
Last modified:2007/11/30 05:29:50
Keyword(s):[blog] [frontpage] [c] [ocaml] [haskell] [micro] [benchmark] [assembly]
References:

*1 I really believe things could have been much more interesting hadn't its title been 'Holy Shmoly, Ruby 1.9 smokes Python away!'