Timing method. I probably would have set it up so the test was selected by a command-line argument, so I could time it with perf stat ./unaligned-test
, and get perf counter results instead of just wall-clock times for each test. That way, I wouldn't have to care about turbo / power-saving, since I could measure in core clock cycles. (Not the same thing as gettimeofday
/ rdtsc
reference cycles unless you disable turbo and other frequency-variation.)
You're only testing throughput, not latency, because none of the loads are dependent.
Your cache numbers will be worse than your memory numbers, but you maybe won't realize that it's because your cache numbers may be due to bottlenecking on the number of split-load registers that handle loads/stores that cross a cache-line boundary. For sequential read, the outer levels of cache are still always just going to see a sequence of requests for whole cache lines. It's only the execution units getting data from L1D that have to care about alignment. To test misalignment for the non-cached case, you could do scattered loads, so cache-line splits would need to bring two cache lines into L1.
Cache lines are 64 bytes wide1, so you're always testing a mix of cache-line splits and within-a-cache-line accesses. Testing always-split loads would bottleneck harder on the split-load microarchitectural resources. (Actually, depending on your CPU, the cache-fetch width might be narrower than the line size. Recent Intel CPUs can fetch any unaligned chunk from inside a cache line, but that's because they have special hardware to make that fast. Other CPUs may only be at their fastest when fetching within a naturally-aligned 16 byte chunk or something. @BeeOnRope says that AMD CPUs may care about 16 byte and 32 byte boundaries.)
You're not testing store → load forwarding at all. For existing tests, and a nice way to visualize results for different alignments, see this stuffedcow.net blog post: Store-to-Load Forwarding and Memory Disambiguation in x86 Processors.
Passing data through memory is an important use case, and misalignment + cache-line splits can interfere with store-forwarding on some CPUs. To properly test this, make sure you test different misalignments, not just 1:15 (vector) or 1:3 (integer). (You currently only test a +1 offset relative to 16B-alignment).
I forget if it's just for store-forwarding, or for regular loads, but there may be less penalty when a load is split evenly across a cache-line boundary (an 8:8 vector, and maybe also 4:4 or 2:2 integer splits). You should test this. (I might be thinking of P4 lddqu
or Core 2 movqdu
)
Intel's optimization manual has big tables of misalignment vs. store-forwarding from a wide store to narrow reloads that are fully contained in it. On some CPUs, this works in more cases when the wide store was naturally-aligned, even if it doesn't cross any cache-line boundaries. (Maybe on SnB/IvB, since they use a banked L1 cache with 16B banks, and splits across those can affect store forwarding.
I didn't re-check the manual, but if you really want to test this experimentally, that's something you should be looking for.)
Which reminds me, misaligned loads are more likely to provoke cache-bank conflicts on SnB/IvB (because one load can touch two banks). But you won't see this loading from a single stream, because accessing the same bank in the same line twice in one cycle is fine. It's only accessing the same bank in different lines that can't happen in the same cycle. (e.g., when two memory accesses are a multiple of 128 bytes apart.)
You don't make any attempt to test 4k page-splits. They are slower than regular cache-line splits, because they also need two TLB checks. (Skylake improved them from a ~100 cycles penalty to a ~5 cycles penalty beyond the normal load-use latency, though)
You fail to test movups
on aligned addresses, so you wouldn't detect that movups
is slower than movaps
on Core?2 and earlier even when the memory is aligned at runtime. (I think unaligned mov
loads up to 8 bytes were fine even in Core?2, as long as they didn't cross a cache-line boundary. IDK how old a CPU you'd have to look at to find a problem with non-vector loads within a cache line. It would be a 32-bit only CPU, but you could still test 8 byte loads with MMX or SSE, or even x87. P5 Pentium and later guarantee that aligned 8 byte loads/stores are atomic, but P6 and newer guarantee that cached 8 byte loads/stores are atomic as long as no cache-line boundary is crossed. Unlike AMD, where 8 byte boundaries matter for atomicity guarantees even in cacheable memory. Why is integer assignment on a naturally aligned variable atomic on x86?)
Go look at Agner Fog's stuff to learn more about how unaligned loads can be slower, and cook up tests to exercise those cases. Actually, Agner may not be the best resource for that, since his microarchitecture guide mostly focuses on getting uops through the pipeline. Just a brief mention of the cost of cache-line splits, nothing in-depth about throughput vs. latency.
See also: Cacheline splits, take two, from Dark Shikari's blog (x264 lead developer), talking about unaligned load strategies on Core2: it was worth it to check for alignment and use a different strategy for the block.
Footnotes:
- 64B cache lines is a safe assumption these days. Pentium 3 and earlier had 32B lines. P4 had 64B lines but they were often transferred in 128B-aligned pairs. I thought I remembered reading that P4 actually had 128B lines in L2 or L3, but maybe that was just a distortion of 64B lines transferred in pairs. 7-CPU definitely says 64B lines in both levels of cache for a P4 130nm.
See also uarch-bench results for Skylake. Apparently someone has already written a tester that checks every possible misalignment relative to a cache-line boundary.
##My testing on Skylake desktop (i7-6700k):
Addressing mode affects load-use latency, exactly as Intel documents in their optimization manual. I tested with integer mov rax, [rax+...]
, and with movzx/sx
(in that case using the loaded value as an index, since it's too narrow to be a pointer).
;;; Linux x86-64 NASM/YASM source. Assemble into a static binary
;; public domain, originally written by [email protected].
;; Share and enjoy. If it breaks, you get to keep both pieces.
;;; This kind of grew while I was testing and thinking of things to test
;;; I left in some of the comments, but took out most of them and summarized the results outside this code block
;;; When I thought of something new to test, I'd edit, save, and up-arrow my assemble-and-run shell command
;;; Then edit the result into a comment in the source.
section .bss
ALIGN 2 * 1<<20 ; 2MB = 4096*512. Uses hugepages in .bss but not in .data. I checked in /proc/<pid>/smaps
buf: resb 16 * 1<<20
section .text
global _start
_start:
mov esi, 128
; mov edx, 64*123 + 8
; mov edx, 64*123 + 0
; mov edx, 64*64 + 0
xor edx,edx
;; RAX points into buf, 16B into the last 4k page of a 2M hugepage
mov eax, buf + (2<<20)*0 + 4096*511 + 64*0 + 16
mov ecx, 25000000
%define ADDR(x) x ; SKL: 4c
;%define ADDR(x) x + rdx ; SKL: 5c
;%define ADDR(x) 128+60 + x + rdx*2 ; SKL: 11c cache-line split
;%define ADDR(x) x-8 ; SKL: 5c
;%define ADDR(x) x-7 ; SKL: 12c for 4k-split (even if it's in the middle of a hugepage)
; ... many more things and a block of other result-recording comments taken out
%define dst rax
mov [ADDR(rax)], dst
align 32
.loop:
mov dst, [ADDR(rax)]
mov dst, [ADDR(rax)]
mov dst, [ADDR(rax)]
mov dst, [ADDR(rax)]
dec ecx
jnz .loop
xor edi,edi
mov eax,231
syscall
Then run with
asm-link load-use-latency.asm && disas load-use-latency &&
perf stat -etask-clock,cycles,L1-dcache-loads,instructions,branches -r4 ./load-use-latency
+ yasm -felf64 -Worphan-labels -gdwarf2 load-use-latency.asm
+ ld -o load-use-latency load-use-latency.o
(disassembly output so my terminal history has the asm with the perf results)
Performance counter stats for './load-use-latency' (4 runs):
91.422838 task-clock:u (msec) # 0.990 CPUs utilized ( +- 0.09% )
400,105,802 cycles:u # 4.376 GHz ( +- 0.00% )
100,000,013 L1-dcache-loads:u # 1093.819 M/sec ( +- 0.00% )
150,000,039 instructions:u # 0.37 insn per cycle ( +- 0.00% )
25,000,031 branches:u # 273.455 M/sec ( +- 0.00% )
0.092365514 seconds time elapsed ( +- 0.52% )
In this case, I was testing mov rax, [rax]
, naturally-aligned, so cycles = 4*L1-dcache-loads. 4c latency. I didn't disable turbo or anything like that. Since nothing is going off the core, core clock cycles is the best way to measure.