Yes, the write combining and coalescing properties of the LFBs support all memory types except the UC type. You can observe their impact experimentally using the following program. It takes two parameters as input:
STORE_COUNT
: the number of 8-byte stores to perform sequentially.
INCREMENT
: the stride between consecutive stores.
There are 4 different values of INCREMENT
that are particularly interesting:
- 64: All stores are performed on unique cache lines. Write combining and coalescing will not take an effect.
- 0: All stores are to the same cache line and the same location within that line. Write coalescing takes effect in this case.
- 8: Every 8 consecutive stores are to the same cache line, but different locations within that line. Write combining takes effect in this case.
- 4: The target locations of consecutive stores overlap within the same cache line. Some stores might cross two cache lines (depending on
STORE_COUNT
). Both write combining and coalescing will take an effect.
There is another parameter, ITERATIONS
, which is used to repeat the same experiment many times to make reliable measurements. You can keep it at 1000.
%define ITERATIONS 1000
BITS 64
DEFAULT REL
section .bss
align 64
bufsrc: resb STORE_COUNT*64
section .text
global _start
_start:
mov ecx, ITERATIONS
.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.flush:
clflush [rsi]
sfence
lfence
add rsi, 64
sub edx, 1
jnz .flush
; This is the main loop where the stores are issued sequentially.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.inner:
mov [rsi], rdx
sfence ; Prevents potential combining in the store buffer.
add rsi, INCREMENT
sub edx, 1
jnz .inner
; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
mov edx, 100000
.wait:
lfence
sub edx, 1
jnz .wait
sub ecx, 1
jnz .loop
; Exit.
xor edi,edi
mov eax,231
syscall
I recommend the following setup:
- Disable all hardware prefetchers using
sudo wrmsr -a 0x1A4 0xf
. This ensures that they will not interfere (or have minimal interference) with the experiments.
- Set the CPU frequency to the maximum. This increases the probability that the main loop will be fully executed before the first cache line reaches the L1 and causes an LFB to be freed.
- Disable hyperthreading because the LFBs are shared (at least since Sandy Bridge, but not on all microarchitectures).
The L1D_PEND_MISS.FB_FULL
performance counter enables us to capture the effect of write combining regarding how it impacts the availability of LFBs. It is supported on Intel Core and later. It is described as follows:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
First run the code without the inner loop and make sure that L1D_PEND_MISS.FB_FULL
is zero, which means the the flush loop has no impact on the event count.
The following figure plots STORE_COUNT
against total L1D_PEND_MISS.FB_FULL
divided by ITERATIONS
.
We can observe the following:
- It's clear that there are exactly 10 LFBs.
- When write combining or coalescing is possible,
L1D_PEND_MISS.FB_FULL
is zero for any number of stores.
- When the stride is 64 bytes,
L1D_PEND_MISS.FB_FULL
is larger than zero when the number of stores is larger than 10.
Later you have that "[WC is] particularly important for writes to
uncached memory", seemly contradicting the "doesn't apply to UC part".
Both WC and UC are classified as uncachable. So you can put the two statements together to deduce that WC is particularly important for writes to WC memory.
See also: Where is the Write-Combining Buffer located? x86.