The problem is that the compiler is doing too much optimizations :)
First of all, I disabled the inlining of make_x()
otherwise we cannot distinguish between RVO and inlining. However, I did put the rest into an anonymous namespace so that external linkage is not interfering with any other compiler optimizations. (As evidence shows, external linkage can prevent inlining for example, and who knows what else...) I rewrote the input-output, now it uses printf()
; otherwise the generated assembly code would be cluttered due to all the iostream
stuff. So the code:
#include <cstdio>
using namespace std;
namespace {
struct x {
//int dummy[1024];
x() { printf("original x address %p
", this); }
};
__attribute__((noinline)) x make_x() {
return x();
}
} // namespace
int main() {
auto x1 = make_x();
printf("copy of x address %p
", &x1);
}
I analyzed the generated assembly code with a colleague of mine as my understanding of the gcc generated assembly is very limited. Later today, I used clang with the -S -emit-llvm
flags to generate LLVM assembly which I personally find much nicer and easier to read than the X86 Assembly/GAS Syntax. It didn't matter which compiler was used, the conclusions are the same.
I rewrote the generated assembly in C++, it roughly looks like this if x
is empty:
#include <cstdio>
using namespace std;
struct x { };
void make_x() {
x tmp;
printf("original x address %p
", &tmp);
}
int main() {
x x1;
make_x();
printf("copy of x address %p
", &x1);
}
If x
is big (the int dummy[1024];
member uncommented):
#include <cstdio>
using namespace std;
struct x { int dummy[1024]; };
void make_x(x* x1) {
printf("original x address %p
", x1);
}
int main() {
x x1;
make_x(&x1);
printf("copy of x address %p
", &x1);
}
It turns out that make_x()
only has to print some valid, unique address if the object is empty. make_x()
has the liberty to print some valid address pointing to its own stack if the object is empty. There is also nothing to be copied, there is nothing to return from make_x()
.
If you make the object bigger (add the int dummy[1024];
member for example), it gets constructed in place so RVO does kick in, and only the objects' address is passed to make_x()
to be printed. No object gets copied, nothing gets moved.
If the object is empty, the compiler can decide not to pass an address to make_x()
(What a waste of resources would that be? :) ) but let make_x()
make up a unique, valid address from its own stack. When this optimization happens is somewhat fuzzy and hard to reason about (that is what you see with y
) but it really doesn't matter.
RVO looks like to happen consistently in those cases where it matters. And, as my earlier confusion shows, even the whole make_x()
function can get inlined so there is no return value to be optimized away in the first place.