Monday, September 12, 2011

Atomic read-modify-write operation on multicore

Recently, while working on our Shotgun large scale sparse logistic regression code, I learned some cool programming trick from Aapo Kyrola, who in turn learned it from Prof. Guy Blelloch from CMU.

The problem arises when you have an array, and you want multiple cores to add values to the same array position concurrently. This of course may result in undetermined behavior of the needed precautions are not taken.

A nice way to solve this problem is the following. Define the following
scary assembler procedure:

bool CAS(long *ptr, long oldv, long newv) {
      unsigned char ret;
      /* Note that sete sets a 'byte' not the word */
      __asm__ __volatile__ (
                    "  lock\n"
                    "  cmpxchgq %2,%1\n"
                    "  sete %0\n"
                    : "=q" (ret), "=m" (*ptr)
                    : "r" (newv), "m" (*ptr), "a" (oldv)
                    : "memory");
      return ret;
The above procedure defines a read-modify-write lock on the array, and permits
only one thread at a time to write to the specific array value given in ptr.
The way to use this procedure is as follows:
void add(int idx, double fact) {
        volatile double prev;
        volatile double newval;
        volatile double oldval;
        do {
            prev = arr[idx];
            oldval = prev;
            newval = prev+fact;
        } while (!CAS(reinterpret_cast<long *>(&arr[idx]), *reinterpret_cast<volatile long *>(&prev), *reinterpret_cast<volatile long*>(&newval)));
And here is some more detailed explanation from Guy Blelloch:
The CAS instruction is one of the machine instructions on the x86 processors (the first function is just calling the instruction, which has name cmpxchgq). Probably the best book that describes its various applications is Herlihy and Shavit's book titled "The Art of Multiprocessor Programming". The general use is for implementing atomic read-modify-write operations. The idea is to read the value, make some modification to it (e.g. increment it) and then write it back if the value has not changed in the meantime. The CAS(ptr, a, b)
function conditionally writes a value b into ptr if the current value equals a.


  1. Hi Danny,

    This post is too timely (I was debating emailing you about this, but I didn't want to harass you ... but since you started it, the field is open :-)

    I was having a back and forth on R-devel to see what needs to be done to get around this particular inlining __asm__ in order to get shotgun/buckshot to compile on architectures other than x86_64.

    I was hit with this problem the first time I tried to compile the library because OS X, by default, will try to x-compile the library for both i386, and x86_64 (for now, anyway), and the i386 build was failing.

    I actually had a lot of assembly questions below, which I removed for brevity's sake and I'll substitute with:

    I "see" (google, SO) that gcc also has some atomic lock operations, eg:

    Do you (or Aapo) have any experience with them?

    Currently I'm just checking to see if we're compiling on x86_64, and if not having buckshot "fail gracefully" during runtime, but still let the compiler finish w/o error for the problematic architecture, like so:

    Ideally I'd like to replace the Rf_error(...) with portable (maybe slower) code that would compile on other archs ... I'll hunt around for such a solution, but if you (or Aapo) have any hints, I'd be much obliged.


  2. Thinking outloud:

    Maybe in non x86_64 archs, providing some "normal" c++ code to do the operation, but guarding it with `#pragma omp critical` might suffice, no? For example, inside the `inline bool CAS` function:

    #if __x86_64__
    // assembly code
    #pragma omp critical
    //non-spiffy C++ version

    Not sure if that's even right, or if it the pragma can work "so far away" (deep into) the for loop that's being ||-ized, or even it is correct, how badly it would degrade performance.

    Was just thinking of a temporary, easy out.


  3. Hi Steve,
    Here is the answer I got from our great Yucheng Low:
    The Locked compare and exchange is already provided as a GCC builtin.

    That function can be written exactly as

    inline bool CAS(long *ptr, long oldv, long newv) { return __sync_bool_compare_and_swap(ptr, oldv, newv); }

    Can you please try it out?