Wednesday, May 15, 2013

Parallel integer summation challenge Erlang vs CLR/C#

Today we have done an interesting benchmark in C# and Erlang. Basically the point was to compare the brevity and performance of two inherently different langs/approaches:

Erlang r16b:

$ cat t.erl

-export([t/0, t/1, ct/1, ct/2]).

-define(DEF_CNT, 10000000).

t() ->
t(N) when is_integer(N) ->
    time_it(fun() -> test(N) end, N).

ct(Threads) ->
    ct(?DEF_CNT, Threads).
ct(N, Threads) when is_integer(N), is_integer(Threads) ->
    time_it(fun() -> ctest(N, Threads) end, N * Threads).

time_it(Fun, Divisor) ->
    {T, Result} = timer:tc(Fun),
    {T / (Divisor/1000000) / 1000, Result}.

ctest(N, M) ->
    Pids = spawner(self(), N, M),
    lists:foldl(fun(Pid, S) ->
        receive {Pid, Sum} -> S+Sum end
    end, 0, Pids).

spawner(_, _, 0) ->
spawner(Owner, N, M) ->
    [spawn(fun() -> loop(Owner, N) end) | spawner(Owner, N, M-1)].

loop(Pid, N) ->
    Sum = test(N),
    Pid ! {self(), Sum}.

test(N) -> test(N, 0).

test(0, M) -> M;
test(N, M) -> test(N-1, M+1).

C#, NET 4:

private void button4_Click(object sender, EventArgs e)
    const int CNT = 100000000;
    const int SPLIT = 1000000;

    var tasks = new Task[CNT / SPLIT];

    var w = Stopwatch.StartNew();
    for(int i=0,start=0; i    {
    tasks[i] = new Task( () =>
                                    long lsum = 0;
                                    long end = start + SPLIT;
                                    for(int c=start; c                                    return lsum;

    long sum = tasks.Sum( t => t.Result);

    var rate = CNT / (w.Elapsed.TotalSeconds);

    Text = string.Format("Total: {0} in {1} ms  at {2}/sec  {3} msec/million ",

w.Elapsed.TotalMilliseconds / (CNT / 1000000)

    var s = sum;
    Text+= s.ToString();

Test results are really interesting:
  .NET      0.22 msec per million
  Erlang does this is 1.5 msec per million

CLR is 6.8 faster for this test.

Some other facts.  Linear loop single threaded that sums integers (not optimized out of code, I confirmed in disassembler)

PHP code:
 $CNT = 10000000;
 $sum = 0;
 $w = microtime(true);
 for($i=0; $i<$CNT; $i++) $sum += 1;
 $w = round((microtime(true)-$w)*1000,0);

PHP standard runtime:  1040 ms
C#:                                     2ms

CLR is 500 times faster that PHP for this simple test

Tuesday, May 14, 2013

C# CLR JIT Optimization - Register allocation for primitives

Today I have come across a strange effect in C#

This code took 30 ms to run, and C++ version took 5-8msec. How come?

void Runtest()
   const int CNT = 10000000;//ten million
   long sum = 0;

   var w = Stopwatch.StartNew();

    for(int i=0; i < CNT, i++)
       sum += 1;

   var rate = CNT / (w.Elapsed.TotalSeconds);

   Text = string.Format("Total: {0} in {1} ms  at {2}/sec ", CNT, 
                         w.Elapsed.TotalMilliseconds, rate);

   Text+= sum.ToString();

The loop body does not do anything, and hopefully CLR JITs it into registers right?

If yes, why is it so many times slower than C++?

The answer lies with the last line:  "s.ToString();".  It treats "sum" as a struct, because structs are not really primitive types, although "long" fits in the register 1:1,
 the "objecty" nature of the "sum.ToString()" makes JIT generate memory-accessing code.I have confirmed this in WinDBG. The loop control var "i" is stored in the register but "sum" is not.


      var s = sum;
   Text += s.ToString();

Result: 2 msec per 10,000,000 iterations instead of 30 ms. 15 times faster.

Here I have put an upper loop bound to Int.MaxValue for the code to run a bit longer so I can manage to ctrl+break in debugger.

Disassembly (loop body is highlighted):

000007ff`0014055d 33c0            xor     eax,eax         //i=0
000007ff`0014055f 33ed            xor     ebp,ebp         //sum=0
000007ff`00140561 48ffc5          inc     rbp               //sum++  
000007ff`00140564 ffc0            inc     eax               //i++  
000007ff`00140566 3dffffff7f      cmp     eax,7FFFFFFFh      //check int.Max 
000007ff`0014056b 7cf4            jl      000007ff`00140561 //jmp if less  LOOP LABEL

CAN not beat that!

The bottom line is this: JIT does optimize primitives in CPU registers just like C++ does, however because of unified type system it is easy to treat a primitive as struct or object - and that introduces penalty.
The truth of the matter is that until the loop exit they (Microsoft) could have kept it in register and then moved to RAM, but there are many edge cases like this and who knows what other issues JIT developers had to solve.