performance - Why do SSE instructions preserve the upper 128-bit of the YMM registers?

Question

Welcome To Ask or Share your Answers For Others

performance - Why do SSE instructions preserve the upper 128-bit of the YMM registers?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

performance - Why do SSE instructions preserve the upper 128-bit of the YMM registers?

It seems to be a recurring problem that many Intel processors (up until Skylake, unless I'm wrong) exhibit poor performance when mixing AVX-256 instructions with SSE instructions.

According to Intel's documentation, this is caused by SSE instructions being defined to preserve the upper 128 bits of the YMM registers, so in order to be able to save power by not using the upper 128 bits of the AVX datapaths, the CPU stores those bits away when executing SSE code and reloads them when entering AVX code, the stores and loads being expensive.

However, I can find no obvious reason or explanation why SSE instructions needed to preserve those upper 128 bits. The corresponding 128-bit VEX instructions (the use of which avoids the performance penalty) work by always clearing the upper 128 bits of the YMM registers instead of preserving them. It seems to me that, when Intel defined the AVX architecture, including the extension of the XMM registers to YMM registers, they could have simply defined that the SSE instructions, too, would clear the upper 128 bits. Obviously, since the YMM registers were new, there could have been no legacy code that would have depended on SSE instructions preserving those bits, and it also appears to me that Intel could have easily seen this coming.

So, what is the reason why Intel defined the SSE instructions to preserve the upper 128 bits of the YMM registers? Is it ever useful?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:00:52+0000

In order to move external resources in-site, I've extracted the relevant paragraphs from the link Michael provided in the comments.

All credits go to him.
The link points to a very similar question Agner Fog asked on the Intel's forum.

[Fog in respone to Intel's answer] If I understand you right, you decided that it is necessary to have two versions of all 128-bit instructions in order to avoid destroying the upper part of the YMM registers in case an interrupt calls a device driver using legacy XMM instructions.

Intel were concerned that by making legacy SSE instructions zeroing the upper part of the XMM registers the ISRs would now suddenly affect the new YMM registers.
Without support for saving the new YMM context this would make the use of AVX impossible under any circumstances.

However Fog was not completely satisfied and pointed out that by simply recompiling a driver with an AVX aware compiler (so that VEX instruction were used) would result in the same outcome.

Intel replied that their goal was to avoid forcing existing software to be rewritten.

There is no way we could compel the industry to rewrite/fix all of their existing drivers (for example to use XSAVE) and no way to guarantee they would have done so successfully. Consider for example the pain the industry is still going through on the transition from 32 to 64-bit operating systems! The feedback we have from OS vendors also precluded adding overhead to the ISR servicing to add the state management overhead on every interrupt. We didn't want to inflict either of these costs on portions of the industry that don't even typically use wide vectors.

By having two versions of the instructions, support for AVX in drivers can be achieved like it has been for FPU/SSE:

The example given is similar to the current scenario where a ring-0 driver (ISR) vendor attempts to use floating-point state, or accidentally links it in some library, in OSs that do not automatically manage that context at Ring-0. This is a well known source of bugs and I can suggest only the following:

On those OSs, driver developers are discouraged from using floating-point or AVX

Driver developers should be encouraged to disable hardware features during driver validation (i.e. AVX state can be disabled by drivers in Ring-0 through XSETBV()

Categories

performance - Why do SSE instructions preserve the upper 128-bit of the YMM registers?

performance - Why do SSE instructions preserve the upper 128-bit of the YMM registers?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags