Wednesday, June 26, 2013

C#, Async Glue programming with CallReactor class

Our developers ask me: What is the "reactor"? Plainly: a reactor is a mechanism to breath-in execution flow into some object graph for asynchronous execution, usually a callback/event dispatcher. In other words - it is at least one thread of execution that REACTS to certain condition/change in data (such as property change or collection change). A classical use of reactors are event loops, i.e. the one used by the UI thread that monitors event queue and fires events (callbacks) appropriately. Reactors are ideal for multi-threaded programming when otherwise disjoint parties need to play in concert. The NFX.Glue.CallReactor class facilitates asynchronous call dispatching and provides callback mechanism upon result receipt from the server. Every instance has a dedicated thread that monitors all Call instances in the reactor. Here is how it works:
  • Call Async_* version of Glue client proxy call. It returns CallSlot instance
  • A call to Async_* method is wrapped in the new Call{} construct that remembers it in the reactor scope
  • Pass a lambda function that gets called back by the reactor
  • Optionally, the reactor .ctor takes a callback that is fired when all Calls registered with the reactor are done
The code below is taken from the WinForms-based manual test case,hence "Invoke(delegate)" to sync the callback with the UI thread.
var client1 = new MarazmContractClient(cbo.Text);
var client2 = new MarazmContractClient(cbo.Text);

    new CallReactor(false,
        finishedReactor => 
        {
            client1.Dispose(); 
            client2.Dispose();
            Invoke( (Action)( () =>
                       MessageBox.Show(finishedReactor.Context.ToString())
                        ) );
        },
        string.Empty,
        new Call( client1.Async_Echo("One"),
         (reactor, call) => reactor.Context =
           ((string)reactor.Context) + 
            call.CallSlot.GetValue()),
        new Call( client2.Async_Echo("Two"), 
         (reactor, call) => reactor.Context =
           ((string)reactor.Context) +
            call.CallSlot.GetValue()),
        new Call( client1.Async_Echo("Three"),
         (reactor, call) => reactor.Context = 
           ((string)reactor.Context) +
            call.CallSlot.GetValue())
    );

Wednesday, May 15, 2013

Parallel integer summation challenge Erlang vs CLR/C#

Today we have done an interesting benchmark in C# and Erlang. Basically the point was to compare the brevity and performance of two inherently different langs/approaches:

Erlang r16b:

$ cat t.erl

-module(t).
-export([t/0, t/1, ct/1, ct/2]).

-define(DEF_CNT, 10000000).

t() ->
    t(?DEF_CNT).
t(N) when is_integer(N) ->
    time_it(fun() -> test(N) end, N).

ct(Threads) ->
    ct(?DEF_CNT, Threads).
ct(N, Threads) when is_integer(N), is_integer(Threads) ->
    time_it(fun() -> ctest(N, Threads) end, N * Threads).

time_it(Fun, Divisor) ->
    {T, Result} = timer:tc(Fun),
    {T / (Divisor/1000000) / 1000, Result}.

ctest(N, M) ->
    Pids = spawner(self(), N, M),
    lists:foldl(fun(Pid, S) ->
        receive {Pid, Sum} -> S+Sum end
    end, 0, Pids).

spawner(_, _, 0) ->
    [];
spawner(Owner, N, M) ->
    [spawn(fun() -> loop(Owner, N) end) | spawner(Owner, N, M-1)].

loop(Pid, N) ->
    Sum = test(N),
    Pid ! {self(), Sum}.

test(N) -> test(N, 0).

test(0, M) -> M;
test(N, M) -> test(N-1, M+1).


C#, NET 4:

private void button4_Click(object sender, EventArgs e)
{
    const int CNT = 100000000;
    const int SPLIT = 1000000;

    var tasks = new Task[CNT / SPLIT];
       
         

    var w = Stopwatch.StartNew();
          
    for(int i=0,start=0; i    {
    tasks[i] = new Task( () =>
                                {
                                    long lsum = 0;
                                    long end = start + SPLIT;
                                    for(int c=start; c                                    return lsum;
                                }
                            );
    tasks[i].Start();
    }

    long sum = tasks.Sum( t => t.Result);
         
    w.Stop();

    var rate = CNT / (w.Elapsed.TotalSeconds);

    Text = string.Format("Total: {0} in {1} ms  at {2}/sec  {3} msec/million ",

  CNT,
 w.Elapsed.TotalMilliseconds, 
rate,
w.Elapsed.TotalMilliseconds / (CNT / 1000000)
);

    var s = sum;
    Text+= s.ToString();
}



Test results are really interesting:
  .NET      0.22 msec per million
  Erlang does this is 1.5 msec per million

CLR is 6.8 faster for this test.

Some other facts.  Linear loop single threaded that sums integers (not optimized out of code, I confirmed in disassembler)

PHP code:
 $CNT = 10000000;
 $sum = 0;
 $w = microtime(true);
 for($i=0; $i<$CNT; $i++) $sum += 1;
 $w = round((microtime(true)-$w)*1000,0);
 echo($w);


PHP standard runtime:  1040 ms
C#:                                     2ms

CLR is 500 times faster that PHP for this simple test




Tuesday, May 14, 2013

C# CLR JIT Optimization - Register allocation for primitives

Today I have come across a strange effect in C#

This code took 30 ms to run, and C++ version took 5-8msec. How come?

void Runtest()
{
   const int CNT = 10000000;//ten million
   long sum = 0;

   var w = Stopwatch.StartNew();

    for(int i=0; i < CNT, i++)
    {
       sum += 1;
    }
   w.Stop();

   var rate = CNT / (w.Elapsed.TotalSeconds);

   Text = string.Format("Total: {0} in {1} ms  at {2}/sec ", CNT, 
                         w.Elapsed.TotalMilliseconds, rate);

   Text+= sum.ToString();
}


The loop body does not do anything, and hopefully CLR JITs it into registers right?

If yes, why is it so many times slower than C++?

The answer lies with the last line:  "s.ToString();".  It treats "sum" as a struct, because structs are not really primitive types, although "long" fits in the register 1:1,
 the "objecty" nature of the "sum.ToString()" makes JIT generate memory-accessing code.I have confirmed this in WinDBG. The loop control var "i" is stored in the register but "sum" is not.

Solution:

      var s = sum;
   Text += s.ToString();


Result: 2 msec per 10,000,000 iterations instead of 30 ms. 15 times faster.


Here I have put an upper loop bound to Int.MaxValue for the code to run a bit longer so I can manage to ctrl+break in debugger.

Disassembly (loop body is highlighted):

000007ff`0014055d 33c0            xor     eax,eax         //i=0
000007ff`0014055f 33ed            xor     ebp,ebp         //sum=0
LOOP LABEL:
000007ff`00140561 48ffc5          inc     rbp               //sum++  
000007ff`00140564 ffc0            inc     eax               //i++  
000007ff`00140566 3dffffff7f      cmp     eax,7FFFFFFFh      //check int.Max 
000007ff`0014056b 7cf4            jl      000007ff`00140561 //jmp if less  LOOP LABEL

CAN not beat that!

The bottom line is this: JIT does optimize primitives in CPU registers just like C++ does, however because of unified type system it is easy to treat a primitive as struct or object - and that introduces penalty.
The truth of the matter is that until the loop exit they (Microsoft) could have kept it in register and then moved to RAM, but there are many edge cases like this and who knows what other issues JIT developers had to solve.