c - A tested implementation of Peterson Lock algorithm?

Question

Welcome To Ask or Share your Answers For Others

c - A tested implementation of Peterson Lock algorithm?

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:30:42+0000

Peterson's algorithm cannot be implemented correctly in C99, as explained in who ordered memory fences on an x86.

Peterson's algorithm is as follows:

LOCK:

interested[id] = 1                interested[other] = 1
turn = other                      turn = id

while turn == other               while turn == id
  and interested[other] == 1        and interested[id] == 1


UNLOCK:

interested[id] = 0                interested[other] = 0

There are some hidden assumptions here. To begin with, each thread must note its interest in acquiring the lock before giving away its turn. Giving away the turn must make visible to the other thread that we are interested in acquiring the lock.

Also, as in every lock, memory accesses in the critical section cannot be hoisted past the lock() call, nor sunk past the unlock(). I.e.: lock() must have at least acquire semantics, and unlock() must have at least release semantics.

In C11, the simplest way to achieve this would be to use a sequentially consistent memory order, which makes the code run as if it were a simple interleaving of threads running in program order (WARNING: totally untested code, but it's similar to an example in Dmitriy V'jukov's Relacy Race Detector):

lock(int id)
{
    atomic_store(&interested[id], 1);
    atomic_store(&turn, 1 - id);

    while (atomic_load(&turn) == 1 - id
           && atomic_load(&interested[1 - id]) == 1);
}

unlock(int id)
{
    atomic_store(&interested[id], 0);
}

This ensures that the compiler doesn't make optimizations that break the algorithm (by hoisting/sinking loads/stores across atomic operations), and emits the appropriate CPU instructions to ensure the CPU also doesn't break the algorithm. The default memory model for C11/C++11 atomic operations that don't explicitely select a memory model is the sequentially consistent memory model.

C11/C++11 also support weaker memory models, allowing as much optimization as possible. The following is a translation to C11 of the translation to C++11 by Anthony Williams of an algorithm originally by Dmitriy V'jukov in the syntax of his own Relacy Race Detector [petersons_lock_with_C++0x_atomics] [the-inscrutable-c-memory-model]. If this algorithm is incorrect it's my fault (WARNING: also untested code, but based on good code from Dmitriy V'jukov and Anthony Williams):

lock(int id)
{
    atomic_store_explicit(&interested[id], 1, memory_order_relaxed);
    atomic_exchange_explicit(&turn, 1 - id, memory_order_acq_rel);

    while (atomic_load_explicit(&interested[1 - id], memory_order_acquire) == 1
           && atomic_load_explicit(&turn, memory_order_relaxed) == 1 - id);
}

unlock(int id)
{
    atomic_store_explicit(&interested[id], 0, memory_order_release);
}

Notice the exchange with acquire and release semantics. An exchange is an atomic RMW operation. Atomic RMW operations always read the last value stored before the write in the RMW operation. Also, an acquire on an atomic object that reads a write from a release on that same atomic object (or any later write on that object from the thread that performed the release or any later write from any atomic RMW operation) creates a synchronizes-with relation between the release and the acquire.

So, this operation is a synchronization point between the threads, there is always a synchronizes-with relationship between the exchange in one thread and the last exchange performed by any thread (or the initialization of turn, for the very first exchange).

So we have a sequenced-before relationship between the store to interested[id] and the exchange from/to turn, a synchronizes-with relationship between two consecutive exchanges from/to turn, and a sequenced-before relationship between the exchange from/to turn and the load of interested[1 - id]. This amounts to a happens-before relationship between accesses to interested[x] in different threads, with turn providing the synchronization between threads. This forces all the ordering needed to make the algorithm work.

So how were these things done before C11? It involved using compiler and CPU-specific magic. As an example, let's see the pretty strongly-ordered x86. IIRC, all x86 loads have acquire semantics, and all stores have release semantics (save non-temporal moves, in SSE, used precisely to achive higher performance at the cost of ocassionally needing to issue CPU fences to achieve coherence between CPUs). But this is not enough for Peterson's algorithm, as Bartosz Milewsky explains at who-ordered-memory-fences-on-an-x86 , for Peterson's algorithm to work we need to establish an ordering between accesses to turn and interested, failing to do that may result in seeing loads from interested[1 - id] before writes to interested[id], which is a bad thing.

So a way to do that in GCC/x86 would be (WARNING: although I tested something similar to the following, actually a modified version of the code at wrong-implementation-of-petersons-algorithm , testing is nowhere near assuring correctness of multithreaded code):

lock(int id)
{
    interested[id] = 1;
    turn = 1 - id;
    __asm__ __volatile__("mfence");

    do {
        __asm__ __volatile__("":::"memory");
    } while (turn == 1 - id
           && interested[1 - id] == 1);
}

unlock(int id)
{
   interested[id] = 0;
}

The MFENCE prevents stores and loads to different memory addresses from being reordered. Otherwise the write to interested[id] could be queued in the store buffer while the load of interested[1 - id] proceeds. On many current microarchitectures a SFENCE may be enough, since it may be implemented as a store buffer drain, but IIUC SFENCE doesn't need to be implemented that way, and may simply prevent reordering between stores. So SFENCE may not be enough everywhere, and we need a full MFENCE.

The compiler barrier (__asm__ __volatile__("":::"memory")) prevents the compiler from deciding that it already knows the value of turn. We're telling the compiler that we've clobbered memory, so all values cached in registers must be reloaded from memory.

P.S: I feel this needs a closing paragraph, but my brain is drained.

Categories

c - A tested implementation of Peterson Lock algorithm?

c - A tested implementation of Peterson Lock algorithm?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags