Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
245 views
in Technique[技术] by (71.8m points)

c++ - Using nested vectors vs a flatten vector wrapper, strange behaviour

The problem

For a long time I had the impression that using a nested std::vector<std::vector...> for simulating an N-dimensional array is in general bad, since the memory is not guarantee to be contiguous, and one may have cache misses. I thought it's better to use a flat vector and map from multiple dimensions to 1D and vice versa. So, I decided to test it (code listed at the end). It is pretty straightforward, I timed reading/writing to a nested 3D vector vs my own 3D wrapper of an 1D vector. I compiled the code with both g++ and clang++, with -O3 optimization turned on. For each run I changed the dimensions, so I can get a pretty good idea about the behaviour. To my surprise, these are the results I obtained on my machine MacBook Pro (Retina, 13-inch, Late 2012), 2.5GHz i5, 8GB RAM, OS X 10.10.5:

g++ 5.2

dimensions       nested   flat
X   Y   Z        (ms)     (ms) 

100 100 100  ->  16       24
150 150 150  ->  58       98
200 200 200  ->  136     308
250 250 250  ->  264     746
300 300 300  ->  440    1537

clang++ (LLVM 7.0.0)

dimensions       nested   flat
X   Y   Z        (ms)     (ms) 

100 100 100  ->  16       18
150 150 150  ->  53       61
200 200 200  ->  135     137
250 250 250  ->  255     271
300 300 300  ->  423     477


As you can see, the "flatten" wrapper is never beating the nested version. Moreover, g++'s libstdc++ implementation performs quite badly compared to libc++ implementation, for example for 300 x 300 x 300 the flatten version is almost 4 times slower than the nested version. libc++ seems to have equal performance.

My questions:

  1. Why isn't the flatten version faster? Shouldn't it be? Am I missing something in the testing code?
  2. Moreover, why does g++'s libstdc++ performs so badly when using flatten vectors? Again, shouldn't it perform better?

The code I used:

#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <random>
#include <vector>

// Thin wrapper around flatten vector
template<typename T>
class Array3D
{
    std::size_t _X, _Y, _Z;
    std::vector<T> _vec;
public:
    Array3D(std::size_t X, std::size_t Y, std::size_t Z):
        _X(X), _Y(Y), _Z(Z), _vec(_X * _Y * _Z) {}
    T& operator()(std::size_t x, std::size_t y, std::size_t z)
    {
        return _vec[z * (_X * _Y) + y * _X + x];
    }
    const T& operator()(std::size_t x, std::size_t y, std::size_t z) const
    {
        return _vec[z * (_X * _Y) + y * _X + x];
    }
};

int main(int argc, char** argv)
{
    std::random_device rd{};
    std::mt19937 rng{rd()};
    std::uniform_real_distribution<double> urd(-1, 1);

    const std::size_t X = std::stol(argv[1]);
    const std::size_t Y = std::stol(argv[2]);
    const std::size_t Z = std::stol(argv[3]);


    // Standard library nested vector
    std::vector<std::vector<std::vector<double>>>
        vec3D(X, std::vector<std::vector<double>>(Y, std::vector<double>(Z)));

    // 3D wrapper around a 1D flat vector
    Array3D<double> vec1D(X, Y, Z);

    // TIMING nested vectors
    std::cout << "Timing nested vectors...
";
    auto start = std::chrono::steady_clock::now();
    volatile double tmp1 = 0;
    for (std::size_t x = 0 ; x < X; ++x)
    {
        for (std::size_t y = 0 ; y < Y; ++y)
        {
            for (std::size_t z = 0 ; z < Z; ++z)
            {
                vec3D[x][y][z] = urd(rng);
                tmp1 += vec3D[x][y][z];
            }
        }
    }
    std::cout << "Sum: " << tmp1 << std::endl; // we make sure the loops are not optimized out
    auto end = std::chrono::steady_clock::now();
    std::cout << "Took: ";
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << ms << " milliseconds
";

    // TIMING flatten vector
    std::cout << "Timing flatten vector...
";
    start = std::chrono::steady_clock::now();
    volatile double tmp2 = 0;
    for (std::size_t x = 0 ; x < X; ++x)
    {
        for (std::size_t y = 0 ; y < Y; ++y)
        {
            for (std::size_t z = 0 ; z < Z; ++z)
            {
                vec1D(x, y, z) = urd(rng);
                tmp2 += vec1D(x, y, z);
            }
        }
    }
    std::cout << "Sum: " << tmp2 << std::endl; // we make sure the loops are not optimized out
    end = std::chrono::steady_clock::now();
    std::cout << "Took: ";
    ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << ms << " milliseconds
";
}

EDIT

Changing the Array3D<T>::operator() return to

return _vec[(x * _Y + y) * _Z + z];

as per @1201ProgramAlarm's suggestion does indeed get rid of the "weird" behaviour of g++, in the sense that the flat and nested versions take now roughly the same time. However it's still intriguing. I thought the nested one will be much worse due to cache issues. May I just be lucky and have all the memory contiguously allocated?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Why the nested vectors are about the same speed as flat in your microbenchmark, after fixing the indexing order: You'd expect the flat array to be faster (see Tobias's answer about potential locality problems, and my other answer for why nested vectors suck in general, but not too badly for sequential access). But your specific test is doing so many things that let out-of-order execution hide the overhead of using nested vectors, and/or that just slow things down so much that the extra overhead is lost in measurement noise.

I put your performance-bugfixed source code up on Godbolt so we can look at the asm of the inner loop as compiled by g++5.2, with -O3. (Apple's fork of clang might be similar to clang3.7, but I'll just look at the gcc version.) There's a lot of code from C++ functions, but you can right-click on a source line to scroll the asm windows to the code for that line. Also, mouseover a source line to bold the asm that implements that line, or vice versa.

gcc's inner two loops for the nested version are as follows (with some comments added by hand):

## outer-most loop not shown

.L213:  ## middle loop (over `y`)
    test    rbp, rbp        # Z
    je      .L127           # inner loop runs zero times if Z==0
    mov     rax, QWORD PTR [rsp+80]   # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D]
    xor     r15d, r15d        # z = 0
    mov     rax, QWORD PTR [rax+r12]  # MEM[(struct vector * *)_195], MEM[(struct vector * *)_195]
    mov     rdx, QWORD PTR [rax+rbx]  # D.103857, MEM[(double * *)_38]

## Top of inner-most loop.
.L128:
    lea     rdi, [rsp+5328]   # tmp511,   ## function arg: pointer to the RNG object, which is a local on the stack.
    lea     r14, [rdx+r15*8]  # D.103851,  ## r14 = &(vec3D[x][y][z])
    call    double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&)  #
    addsd   xmm0, xmm0    # D.103853, D.103853  ## return val *= 2.0: [0.0, 2.0]
    mov     rdx, QWORD PTR [rsp+80]   # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D]   ## redo the pointer-chasing from vec3D.data()
    mov     rdx, QWORD PTR [rdx+r12]  # MEM[(struct vector * *)_150], MEM[(struct vector * *)_150]
    subsd   xmm0, QWORD PTR .LC6[rip]     # D.103859, ## and subtract 1.0:  [-1.0, 1.0]
    mov     rdx, QWORD PTR [rdx+rbx]  # D.103857, MEM[(double * *)_27]
    movsd   QWORD PTR [r14], xmm0 # *_155, D.103859        # store into vec3D[x][y][z]
    movsd   xmm0, QWORD PTR [rsp+64]      # D.103853, tmp1  # reload volatile tmp1
    addsd   xmm0, QWORD PTR [rdx+r15*8]   # D.103853, *_62  # add the value just stored into the array (r14 = rdx+r15*8 because nothing else modifies the pointers in the outer vectors)
    add     r15, 1    # z,
    cmp     rbp, r15  # Z, z
    movsd   QWORD PTR [rsp+64], xmm0      # tmp1, D.103853  # spill tmp1
    jne     .L128     #,
 #End of inner-most loop

.L127:  ## middle-loop
    add     r13, 1    # y,
    add     rbx, 24           # sizeof(std::vector<> == 24) == the size of 3 pointers.
    cmp     QWORD PTR [rsp+8], r13    # %sfp, y
    jne     .L213     #,

 ## outer loop not shown.

And for the flat loop:

 ## outer not shown.
.L214:
    test    rbp, rbp        # Z
    je      .L135       #,
    mov     rax, QWORD PTR [rsp+280]  # D.103849, vec1D._Y
    mov     rdi, QWORD PTR [rsp+288]  # D.103849, vec1D._Z
    xor     r15d, r15d        # z
    mov     rsi, QWORD PTR [rsp+296]  # D.103857, MEM[(double * *)&vec1D + 24B]

.L136:  ## inner-most loop
    imul    rax, r12        # D.103849, x
    lea     rax, [rax+rbx]    # D.103849,
    imul    rax, rdi        # D.103849, D.103849
    lea     rdi, [rsp+5328]   # tmp520,
    add     rax, r15  # D.103849, z
    lea     r14, [rsi+rax*8]  # D.103851,       # &vec1D(x,y,z)
    call    double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&)  #
    mov     rax, QWORD PTR [rsp+280]  # D.103849, vec1D._Y
    addsd   xmm0, xmm0    # D.103853, D.103853
    mov     rdi, QWORD PTR [rsp+288]  # D.103849, vec1D._Z
    mov     rsi, QWORD PTR [rsp+296]  # D.103857, MEM[(double * *)&vec1D + 24B]
    mov     rdx, rax  # D.103849, D.103849
    imul    rdx, r12        # D.103849, x       # redo address calculation a 2nd time per iteration
    subsd   xmm0, QWORD PTR .LC6[rip]     # D.103859,
    add     rdx, rbx  # D.103849, y
    imul    rdx, rdi        # D.103849, D.103849
    movsd   QWORD PTR [r14], xmm0 # MEM[(double &)_181], D.103859  # store into the address calculated earlier
    movsd   xmm0, QWORD PTR [rsp+72]      # D.103853, tmp2
    add     rdx, r15  # tmp374, z
    add     r15, 1    # z,
    addsd   xmm0, QWORD PTR [rsi+rdx*8]   # D.103853, MEM[(double &)_170]   # tmp2 += vec1D(x,y,z).  rsi+rdx*8 == r14, so this is a reload of the store this iteration.
    cmp     rbp, r15  # Z, z
    movsd   QWORD PTR [rsp+72], xmm0      # tmp2, D.103853
    jne     .L136     #,

.L135:  ## middle loop: increment y
    add     rbx, 1    # y,
    cmp     r13, rbx  # Y, y
    jne     .L214     #,

 ## outer loop not shown.

Your MacBook Pro (Late 2012) has an Intel IvyBridge CPU, so I'm using numbers for that microarchitecture from Agner Fog's instruction tables and microarch guide. Things should be mostly the same on other Intel/AMD CPUs.

The only 2.5GHz mobile IvB i5 is the i5-3210M, so your CPU has 3MiB of L3 cache. This means even your smallest test case (100^3 * 8B per double ~= 7.63MiB) is larger than your last-level cache, so none of your test cases fit in cache at all. That's probably a good thing, because you allocate and default-initialize both nested and flat before testing either of them. However, you do test in the same order you allocate, so if the nested array is still cache after zeroing the flat array, the flat array may still be hot in L3 cache after the timing loop over the nested array.

If you'd used a repeat-loop to loop over the same array multiple times, you could have got times large enough to measure for smaller array sizes.


You're doing several things here that are super-weird and make this so slow that out-of-order execution can hide the extra latency of changing y, even if your inner z vectors are not perfectly contiguous.

  1. You run a slow PRNG inside the timed loop. std::uniform_real_distribution<double> urd(-1, 1); is extra overhead on top of std::mt19937 rng{rd()};, which is already slow compared to FP-add latency (3 cycles), or compared to the L1D cache load throughput of 2 per cycle. All this extra time running the PRNG gives out-of-order execution a chance to run the array-indexing instructions so the final address is ready by the time the data is. Unless you have a lot of cache misses, you're mostly just measuring PRNG speed, because it produces results much slower than 1 per clock cycle.

    g++5.2 doesn't fully inline the urd(rng) code, and the x86-64 System V calling convention has no call-preserved XMM registers. So tmp1/tmp2 have to be spilled/reloaded for every element, even if they weren't volatile.

    It also loses its place in the Z vector, and has to redo the outer 2 levels of indirection before


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...