x86 - Globally Invisible load instructions

Question

Welcome To Ask or Share your Answers For Others

x86 - Globally Invisible load instructions

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:10:10+0000

The concept of global visibility for loads is tricky, because a load doesn't modify the global state of memory, and other threads can't directly observe it.

But once the dust settles after out-of-order / speculative execution, we can tell what value the load got if the thread stores it somewhere, or branches based on it. This observable behaviour of the thread is what's important. (Or we could observe it with a debugger, and/or just reason about what values a load could possibly see, if an experiment is difficult.)

At least on strongly-ordered CPUs like x86, all CPUs can agree on a total order of stores becoming globally visible, updating the single coherent+consistent cache+memory state. On x86, where StoreStore reordering isn't allowed, this TSO (Total Store Order) agrees with program-order of each thread. (I.e. the total order is some interleaving of program order from each thread). SPARC TSO is also this strongly ordered.

(Correctly observing the global order of your own stores relative to other stores requires mfence or similar: otherwise store-forwarding means you can see your own stores right away, before they become visible to other core. x86 TSO is basically program-order plus store-forwarding.)

(For cache-bypassing stores, global visibility is when they're flushed from private write-combining buffers into DRAM. Intel Line Fill Buffers or any equivalent private write-combining mechanism where store data is still not visible to other CPUs is effectively part of the store buffer for our reordering purposes.)

On a weakly-ordered ISA, threads A and B might not agree on the order of stores X and Y done by threads C and D, even if the reading threads use acquire-loads to make sure their own loads aren't reordered. i.e. there might not be a global order of stores at all, let alone having it not be the same as program order.

The IBM POWER ISA is that weak, and so is the C++11 memory model (Will two atomic writes to different locations in different threads always be seen in the same order by other threads?). But the mechanism in practice on POWER is that (retired aka graduated) stores become visible to some other cores before they become globally visible by committing to L1d cache. Cache itself really is coherent even in POWER systems, like all normal CPUs, and allows sequential-consistency to be recovered with barriers. These multiple-order effects only happen due to SMT (multiple logical CPUs on one physical CPU) providing a way to see stores from other logical cores without going through cache.

(One possible mechanism is be letting other logical threads snoop non-speculative stores from the store buffer even before they commit to L1d, only keeping not-yet-retired stores private to a logical thread. This could reduce inter-thread latency slightly. x86 can't do this because it would break the strong memory model; Intel's HT statically partitions the store buffer when two threads are active on a core. But as @BeeOnRope comments, an abstract model of what reorderings are allowed is probably a better approach for reasoning about correctness. Just because you can't think of a HW mechanism to cause a reordering doesn't mean it can't happen.)

Weakly-ordered ISAs that aren't as weak as POWER (in practice and/or on paper) still do reordering in the local store buffer of each core, if barriers or release-stores aren't used, though. On many CPUs there is a global order for all stores, but it's not some interleaving of program order. OoO CPUs have to track memory order so a single thread doesn't need barriers to see its own stores in order, but allowing stores to commit from the store buffer to L1d out of program order could certainly improve throughput (especially if there are multiple stores pending for the same line, but program order would evict the line from a set-associative cache between each store. e.g. a nasty histogram access pattern.)

Let's do a thought experiment about where load data comes from

The above is still only about store visibility, not loads. can we explain the value seen by every load as being read from global memory/cache at some point (disregarding any load-ordering rules)?

If so, then all the load results can be explained by putting all the stores and loads by all threads into some combined order, reading and writing a coherent global state of memory.

It turns out that no, we can't, the store buffer breaks this: partial store-to-load forwarding gives us a counter-example (on x86 for example). A narrow store followed by a wide load can merge data from the store buffer with data from the L1d cache from before the store becomes globally visible. Real x86 CPUs actually do this, and we have the real experiments to prove it.

If you only look at full store-forwarding, where the load only takes its data from one store in the store buffer, you could argue that the load is delayed by the store buffer. i.e. that the load appears in the global total load-store order right after the store that makes that value globally visible.

(This global total load-store order isn't an attempt to create an alternative memory-ordering model; it has no way to describe x86's actual load ordering rules.)

Partial store-forwarding exposes the fact that load data doesn't always come from the global coherent cache domain.

If a store from another core changes the surrounding bytes, an atomic wide load could read a value that never existed, and never will exist, in the global coherent state.

See my answer on Can x86 reorder a narrow store with a wider load that fully contains it?, and Alex's answer for experimental proof that such reordering can happen, making the proposed locking scheme in that question invalid. A store and then a reload from the same address isn't a StoreLoad memory barrier.

Some people (e.g. Linus Torvalds) describe this by saying the store buffer isn't coherent. (Linus was replying to someone else who had independently invented the same invalid locking idea.)

Another Q&A involving the store buffer and coherency: How to set bits of a bit vector efficiently in parallel?. You can do some non-atomic ORs to set bits, then come back and check for missed updates due to conflicts with other threads. But you need a StoreLoad barrier (e.g. an x86 lock or) to make sure you don't just see your own stores when you reload.

Proposed definition: A load becomes globally visible when it reads its data. Normally from L1d, but the store buffer or MMIO or uncacheable memory are other possible sources.

This definition agrees with x86 manuals which say that loads aren't reordered with other loads. i.e. they load (in program order) from the local core's view of memory.

The load itself can become globally visible independently of whether any other thread could ever load that value from that address.

Although perhaps it would make more sense not to talk about "global visibility" of cacheable loads at all, because they're pulling data from somewhere, not doing anything with a visible effect. Only uncacheable loads (e.g. from an MMIO region) should be considered visible side-effects.

(On x86, uncacheable stores and loads are very strongly ordered, so store-forwarding to an uncachable store is I think impossible. Unless maybe the store was done via a WB mapping of the same physical page as the UC load is accessing.)

Categories

x86 - Globally Invisible load instructions