Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
491 views
in Technique[技术] by (71.8m points)

x86 64 - Why am I seeing more RFO (Read For Ownership) requests using REP MOVSB than with vmovdqa

Checkout Edit3

I was getting the wrong results because I was measuring without including prefetch triggered events as discussed here. That being said AFAIK I am only see a reduction in RFO requests with rep movsb as compared to Temporal Store memcpy because of better prefetching on loads and no prefetching on stores. NOT due to RFO requests being optimized out for full cache line stores. This kind of makes sense as we don't see RFO requests optimized out for vmovdqa with a zmm register which we would expect if that where really the case for full cache line stores. That being said the lack of prefetching on stores and lack of non-temporal writes makes it hard to see how rep movsb has reasonable performance.

Edit: It is possible that the RFO requests from rep movsb for different those those for vmovdqa in that for rep movsb it might not request data, just take the line in exclusive state. This could also be the case for stores with a zmm register. I don't see any perf metrics to test this however. Does anyone know any?

Questions

  1. Why am I not seeing a reduction in RFO requests when I use rep movsb for memcpy as compared to a memcpy implemented with vmovdqa?
  2. Why am I seeing more RFO requests when I used rep movsb for memcpy as compared to a memcpy implemented with vmovdqa

Two seperate questions because I believe I should be seeing a reduction in RFO requests with rep movsb, but if that is not the case, should I be seeing an increase as well?

Background

CPU - Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz

I was trying to test out the number of RFO requests when using different methods of memcpy including:

  • Temporal Stores -> vmovdqa
  • Non-Temporal Stores -> vmovntdq
  • Enhanced REP MOVSB -> rep movsb

And have been unable to see a reduction in RFO requests using rep movsb. In fact I have been seeing more RFO requests with rep movsb than with Temporal Stores. This is counter-intuitive given that the consensus understanding seems be that for ivybridge and new rep movsb is able to avoid RFO requests and in turn save memory bandwidth:

When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:

  • Avoiding the RFO request when it knows the entire cache line will be overwritten.

Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not

I wrote a simple test program to verify this but was unable to do so.

Test Program

#include <assert.h>
#include <errno.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>

#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))


#define TEMPORAL          0
#define NON_TEMPORAL      1
#define REP_MOVSB         2
#define NONE_OF_THE_ABOVE 3

#define TODO 1

#if TODO == NON_TEMPORAL
#define store(x, y) _mm256_stream_si256((__m256i *)(x), y)
#else
#define store(x, y) _mm256_store_si256((__m256i *)(x), y)
#endif

#define load(x)     _mm256_load_si256((__m256i *)(x))

void *
mmapw(uint64_t sz) {
    void * p = mmap(NULL, sz, PROT_READ | PROT_WRITE,
                    MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    assert(p != NULL);
    return p;
}
void BENCH_ATTR
bench() {
    uint64_t len = 64UL * (1UL << 22);

    uint64_t len_alloc = len;
    char *   dst_alloc = (char *)mmapw(len);
    char *   src_alloc = (char *)mmapw(len);

    for (uint64_t i = 0; i < len; i += 4096) {
        // page in before testing. perf metrics appear to still come through
        dst_alloc[i] = 0;
        src_alloc[i] = 0;
    }

    uint64_t dst     = (uint64_t)dst_alloc;
    uint64_t src     = (uint64_t)src_alloc;
    uint64_t dst_end = dst + len;



    asm volatile("lfence" : : : "memory");
#if TODO == REP_MOVSB
    // test rep movsb
    asm volatile("rep movsb" : "+D"(dst), "+S"(src), "+c"(len) : : "memory");
#elif TODO == TEMPORAL || TODO == NON_TEMPORAL
    // test vmovtndq or vmovdqa
    for (; dst < dst_end;) {
        __m256i lo = load(src);
        __m256i hi = load(src + 32);
        store(dst, lo);
        store(dst + 32, hi);
        dst += 64;
        src += 64;
    }
#endif

    asm volatile("lfence
mfence" : : : "memory");

    assert(!munmap(dst_alloc, len_alloc));
    assert(!munmap(src_alloc, len_alloc));
}

int
main(int argc, char ** argv) {
    bench();
}

  • Build (assuming file name is rfo_test.c):
gcc -O3 -march=native -mtune=native rfo_test.c -o rfo_test
  • Run (assuming executable is rfo_test):
perf stat -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test

Test Data

Note: Data with less noise in edit2

  • TODO = TEMPORAL
       583,912,867      cpu-cycles
         9,352,817      l2_rqsts.all_rfo
       188,343,479      offcore_requests_outstanding.cycles_with_demand_rfo
        11,560,370      offcore_requests.demand_rfo

       0.166557783 seconds time elapsed

       0.044670000 seconds user
       0.121828000 seconds sys
  • TODO = NON_TEMPORAL
       560,933,296      cpu-cycles
         7,428,210      l2_rqsts.all_rfo
       123,174,665      offcore_requests_outstanding.cycles_with_demand_rfo
         8,402,627      offcore_requests.demand_rfo

       0.156790873 seconds time elapsed

       0.032157000 seconds user
       0.124608000 seconds sys
  • TODO = REP_MOVSB
       566,898,220      cpu-cycles
        11,626,162      l2_rqsts.all_rfo
       178,043,659      offcore_requests_outstanding.cycles_with_demand_rfo
        12,611,324      offcore_requests.demand_rfo

       0.163038739 seconds time elapsed

       0.040749000 seconds user
       0.122248000 seconds sys
  • TODO = NONE_OF_THE_ABOVE
       521,061,304      cpu-cycles
         7,527,122      l2_rqsts.all_rfo
       123,132,321      offcore_requests_outstanding.cycles_with_demand_rfo
         8,426,613      offcore_requests.demand_rfo

       0.139873929 seconds time elapsed

       0.007991000 seconds user
       0.131854000 seconds sys

Test Results

The baseline RFO requests with just the setup but without the memcpy is in TODO = NONE_OF_THE_ABOVE with 7,527,122 RFO requests.

With TODO = TEMPORAL (using vmovdqa) we can see 9,352,817 RFO requests. This is lower than with TODO = REP_MOVSB (using rep movsb) which has 11,626,162 RFO requests. ~2 million more RFO requests with rep movsb than with Temporal Stores. The only case I was able to see RFO requests avoided was the TODO = NON_TEMPORAL (using vmovntdq) which has 7,428,210 RFO requests, about the same as the baseline indicating none from the memcpy itself.

I played around with different sizes for memcpy thinking I might need to decrease / increase the size for rep movsb to make that optimization but I have been seeing the same general results. For all sizes I tested I see the number of RFO requests in the following order NON_TEMPORAL < TEMPORAL < REP_MOVSB.

Theories

  • [Unlikely] Something new on Icelake?

Edit: @PeterCordes was able to reproduc the results on Skylake

I don't think this is an Icelake specific thing as the only changes I could find in the Intel Manual on rep movsb for Icelake are:

Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long. Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H, ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance.

Which should not be playing a factor in the test program I am using given that len is well above 128.

  • [Likelier] My test program is broken

I don't see any issues but this is a very surprising result. At the very least verified that the compiler is not optimizing out the tests here

Edit: Fixed build instructions to use G++ instead of GCC and file postfix from .c to .cc

Edit2:

Back to C and GCC.

  • Better Pref Recipe:
perf stat --all-user -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test

Numbers with better perf recipe (same trend but less noise):

  • TODO = TEMPORAL
       161,214,341      cpu-cycles                                                  
         1,984,998      l2_rqsts.all_rfo                                            
        61,238,129      offcore_requests_outstanding.cycles_with_demand_rfo                                   
         3,161,504      offcore_requests.demand_rfo                                   

       0.169413413 seconds time elapsed

       0.044371000 seconds user
       0.125045000 seconds sys
  • TODO = NON_TEMPORAL
       142,689,742      cpu-cycles                                                  
             3,106      l2_rqsts.all_rfo                                            
             4,581      offcore_requests_outstanding.cycles_with_demand_rfo                                   
                30      offcore_requests.demand_rfo                                   

       0.166300952 seconds time elapsed

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...