It seems to be a recurring problem that many Intel processors (up until Skylake, unless I'm wrong) exhibit poor performance when mixing AVX-256 instructions with SSE instructions.
According to Intel's documentation, this is caused by SSE instructions being defined to preserve the upper 128 bits of the YMM registers, so in order to be able to save power by not using the upper 128 bits of the AVX datapaths, the CPU stores those bits away when executing SSE code and reloads them when entering AVX code, the stores and loads being expensive.
However, I can find no obvious reason or explanation why SSE instructions needed to preserve those upper 128 bits. The corresponding 128-bit VEX instructions (the use of which avoids the performance penalty) work by always clearing the upper 128 bits of the YMM registers instead of preserving them. It seems to me that, when Intel defined the AVX architecture, including the extension of the XMM registers to YMM registers, they could have simply defined that the SSE instructions, too, would clear the upper 128 bits. Obviously, since the YMM registers were new, there could have been no legacy code that would have depended on SSE instructions preserving those bits, and it also appears to me that Intel could have easily seen this coming.
So, what is the reason why Intel defined the SSE instructions to preserve the upper 128 bits of the YMM registers? Is it ever useful?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…