c++ - Why does false sharing still affect non atomics, but much less than atomics?

Question

Welcome To Ask or Share your Answers For Others

c++ - Why does false sharing still affect non atomics, but much less than atomics?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - Why does false sharing still affect non atomics, but much less than atomics?

Consider the following example that proves false sharing existence:

using type = std::atomic<std::int64_t>;

struct alignas(128) shared_t
{
  type  a;
  type  b;
} sh;

struct not_shared_t
{
  alignas(128) type a;
  alignas(128) type b;
} not_sh;

One thread increments a by steps of 1, another thread increments b. Increments compile to lock xadd with MSVC, even though the result is unused.

For a structure where a and b are separated, the values accumulated in a few seconds is about ten times greater for not_shared_t than for shared_t.

So far expected result: separate cache lines stay hot in L1d cache, increment bottlenecks on lock xadd throughput, false sharing is a performance disaster ping-ponging the cache line. (editor's note: later MSVC versions use lock inc when optimization is enabled. This may widen the gap between contended vs. uncontended.)

Now I'm replacing using type = std::atomic<std::int64_t>; with plain std::int64_t

(The non-atomic increment compiles to inc QWORD PTR [rcx]. The atomic load in the loop happens to stop the compiler from just keeping the counter in a register until loop exit.)

The reached count for not_shared_t is still greater than for shared_t, but now less than twice.

|          type is          | variables are |      a=     |      b=     |
|---------------------------|---------------|-------------|-------------|
| std::atomic<std::int64_t> |    shared     |   59’052’951|   59’052’951|
| std::atomic<std::int64_t> |  not_shared   |  417’814’523|  416’544’755|
|       std::int64_t        |    shared     |  949’827’195|  917’110’420|
|       std::int64_t        |  not_shared   |1’440’054’733|1’439’309’339|

Why is the non-atomic case so much closer in performance?

Here is the rest of the program to complete the minimum reproducible example. (Also On Godbolt with MSVC, ready to compile/run)

std::atomic<bool> start, stop;

void thd(type* var)
{
  while (!start) ;
  while (!stop) (*var)++;
}

int main()
{
  std::thread threads[] = {
     std::thread( thd, &sh.a ),     std::thread( thd, &sh.b ),
     std::thread( thd, &not_sh.a ), std::thread( thd, &not_sh.b ),
  };

  start.store(true);

  std::this_thread::sleep_for(std::chrono::seconds(2));

  stop.store(true);
  for (auto& thd : threads) thd.join();

  std::cout
    << " shared: "    << sh.a     << ' ' << sh.b     << '
'
    << "not shared: " << not_sh.a << ' ' << not_sh.b << '
';
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:33:52+0000

Non-atomic memory-increments can benefit from store-forwarding when reloading its own stored value. This can happen even while the cache line is invalid. The core knows that the store will happen eventually, and the memory-ordering rules allow this core to see its own stores before they become globally visible.

Store-forwarding gives you the length of the store buffer number of increments before you stall, instead of needing exclusive access to the cache line to do an atomic RMW increment.

When this core does eventually gain ownership of the cache line, it can commit multiple stores at 1/clock. This is 6x faster than the dependency chain created by a memory-destination increment: ~5 cycle store/reload latency + 1 cycle ALU latency. So execution is only putting new stores into the SB at 1/6th the rate it can drain while a core owns it, in the non-atomic case This is why there isn't a huge gap between shared vs. non-shared atomic.

There's certainly going to be some memory ordering machine clears, too; that and/or SB full are the likely reasons for lower throughput in the false sharing case. See answers and comments on What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? for another experiment somewhat like this one.

A lock inc or lock xadd forces the store buffer to drain before the operation, and includes committing to L1d cache as part of the operation. This makes store forwarding impossible, and can only happen when the cache line is owned in Exclusive or Modified MESI states.

Size of store buffers on Intel hardware? What exactly is a store buffer?
Can modern x86 implementations store-forward from more than one prior store? (no, but the details there may help you understand exactly what store buffers do and how store-forwarding works for this case where the reload exactly overlaps with the store.)

Categories

c++ - Why does false sharing still affect non atomics, but much less than atomics?

c++ - Why does false sharing still affect non atomics, but much less than atomics?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags