HPCA - Basic Code Transformations

Loop Identification and Relocation
==================================

Method 1: Relocating into private functions

Each speculative loop is relocated into a new function (with an
automatically-generated name like main_spec_0, main_spec_1, foo_spec_2,
etc.) created just to encapsulate the loop.  While the system is executing
non-speculative code, the "slave" cpus just sit in a master-slave barrier,
waiting for speculation to start -- this can be initiated by adding in a few
lines of code to the program's main function.  When the "master" CPU reaches
a speculative loop, it sends the function pointer of the loop encapsulation
function to the slaves and then calls the loop encapsulation function
itself.  By doing this, the 4 CPUs will all start this function more or less
simultaneously and work on the loop in concert.  When finished, the "slave"
CPUs will go back to their barriers, while the master will continue
executing the non-speculative portions of the program.

Here's how a typical "loop function" would look:

void foo_spec_2(void)
{
. . . << set up privatized stuff for each cpu >> . . .
spec_begin(&&spec_start, &&spec_start, &&spec_terminate, &&spec_terminate_last);
do
{
        spec_endofiteration();
    spec_begin:

        . . . << loop code >> . . . 
        . . . << spec_terminate(); or spec_endofiteration(); may be included
                 where break; or continue; would normally be >> . . .

} while (1);
spec_terminate_last:
    . . . << un-privatize stuff from last iteration >> . . .
spec_terminate:
    . . . << finish cleaning up, with all 4 CPUs >> . . .
}

A BIG note: ALL of the spec_ routines DON'T act like normal functions . . .
they "pop out" at the spec_ labels.  spec_begin and spec_endofiteration pop
out at the spec_begin label, and spec_terminate pops out at the
spec_terminate_last label.  Restarted iterations restart at spec_begin (so
it looks like the iteration being restarted was coming right out of
spec_begin or the last spec_endofiteration, as far as the loop is
concerned), while "killed" loop iterations after a spec_terminate call by
another CPU pop out at spec_terminate.  This results in somewhat funny
program control flow, but it's necessary to make things work quickly and
efficiently.

And here's how the location of the original loop would be modified:

. . . 

if (alreadySpeculating)
{
    << original loop code >>
}
else
{
    slavePointer = foo_spec_2;
    MSlaveBarrier_Release(slaveBarrier);
    foo_spec_2();
    MSlaveBarrier_Wait(slaveBarrier);
}

. . . 

The two variables used are:

    slavePointer = a global "function" pointer that all slaves get upon
        starting speculation
    slaveBarrier = a global struct that contains the barrier's flags


Method 2: Inline loop modification

A problem with this system is that all stack-based variables must be
globalized or copied to/from the other CPUs before running the loop, since
the loop will have originally been out in the middle of an otherwise normal
C function, with a fair amount of data context that might need to be carried
over to the new function.  A possible way to sidestep this would be to NOT
break the program into more functions, but to instead have the slaves vector
right in to the correct location of the original function, like this:

. . .

if (alreadySpeculating)
{
    << original loop code >>
}
else
{
    slavePointer = &&foo_spec_2;

    << force register save, copy stack to slaves >>

    MSlaveBarrier_Release(slaveBarrier);
foo_spec_2: 

    << force register restore >>
    << original loop modified as above >>
  spec_terminate_last:
    << force register save, copy stack back to master >>
  spec_terminate:

    if (pid == 0)
    {
        MSlaveBarrier_Wait(slaveBarrier);
        << force register restore >>
    }
    else
        MSlaveBarrier_SlaveEnter(slaveBarrier);
}

The two variables used are identical to those for method 1.  Also, a
structure would be needed for each processor to save and restore registers.

This solution would be nicer in some ways -- no additional function call, no
fiddling with "globalizing" variables to other functions, etc. -- but ickier
in others -- those forced register and stack saves & restores.  While it
*would* be possible to run with only one stack, and have all of the
processors share it, this would probably result in an unacceptable number of
false-sharing conflicts in the cache.  Since most of the false dependencies
would be WAW, WAW protection bits added to the L1 cache might help some, but
I'm not sure if they would completely solve the problem.



Transforming C Loops
====================

Here are the transformations that our source-to-source translator would need
to perform.  All are fairly mechanical.

1. FOR loops:

Straightforward counted loops with fixed-stride variable increments and
should be modified like the following example.  More complex FOR loops
should be converted into equivalent WHILE loops and modified like #2.

for (i=0; i < 10; i++)
{
    << LOOP CORE >>
}

becomes

i = spec_procid();
spec_begin(&&spec_start, &&spec_start, &&spec_terminate, &&spec_terminate_last);
do
{
        spec_endofiteration();
    spec_begin:

        if (i >= 10) spec_terminate;

        << LOOP CORE >>

        i++;
} while(1);
spec_terminate_last:
spec_terminate:


2. WHILE loops:

while ((c=getchar()) != -1)
{
    << LOOP CORE >>
}

becomes

spec_begin(&&spec_start, &&spec_start, &&spec_terminate, &&spec_terminate_last);
do
{
        spec_endofiteration();
    spec_begin:

        if ((c=getchar()) == -1) spec_terminate;

        << LOOP CORE >>

} while(1);
spec_terminate_last:
spec_terminate:


3. DO loops:

do
{
    << LOOP CORE >>
}
while ((c=getchar()) != -1)

becomes

spec_begin(&&spec_start, &&spec_start, &&spec_terminate, &&spec_terminate_last);
do
{
        spec_endofiteration();
    spec_begin:

        << LOOP CORE >>

} while((c=getchar()) != -1);
    spec_terminate;
spec_terminate_last:
spec_terminate:



Explicit Synchronization
========================

It's also possible to do explicit synchronization in my model using LWC2, a
load that isn't put into the speculative buffers so that it will not cause
violations.  A simple source-to-source modifier probably wouldn't be able to
do this, but I'll leave it in this description anyway.  Here's an example of
synchronization that protects code so each iteration will have to wait its
turn to execute a certain region of code:

    do
        asm volatile("lwc2 %0,%1" : "=r" (__spec_sync) : "o" (__sync_n));
    while (__spec_sync != __iteration);

    << PROTECTED CODE >>

    __sync_n += 1;

The three variables used are:

    __iteration = the iteration number of this speculative region
    __spec_sync = a private temporary variable
    __sync_n = a global variable, one per << PROTECTED >> region, that contains
        the number of the iteration that should be allowed to access the
        region at this time.  It must be reset to 0 before the speculative
        loop in order to allow this code to work.



That's pretty much how all of my speculation stuff currently works.  You can
look at /usr/people/lance/less/wc/wc.out_c.c and
/usr/people/lance/less/compress/compress95.out_c.c on bigdipper to get a
feel for how SUIF uses these primitives (in a separate-function mode).