Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
254 views
in Technique[技术] by (71.8m points)

c++ - Constexpr and SSE intrinsics

Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like

_mm_cmpeq_epi32

My problem with this is that this function is not marked as constexpr, although "semantically" there is no reason for this function to not be constexpr since it is a pure function.

Is there any way I could write my own version of (for example) _mm_cmpeq_epi32 that is constexpr?

Obviously I would like that the function at runtime uses the proper asm, I know I can reimplement any SIMD function with slow function that is constexpr.

If you wonder why I care about constexpr of SIMD functions. Non constexprness is contagious, meaning that any functions of mine that use those SIMD functions can not be constexpr.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It's unfortunate that Intel's intrinsics aren't defined as constexpr.

There's no reason they couldn't be; compilers can and do evaluate them at compile time for constant-propagation and other optimizations. (This is one major reason why builtin functions / intrinsics are better than inline asm wrappers for single instructions.)


Solution for GCC. (Doesn't work for clang or MSVC).

ICC compiles it but chokes when you try to use it as part of an initializer for a constexpr __m128i.

constexpr
__m128i pcmpeqd(__m128i a, __m128i b) {
    return (v4si)a == (v4si)b;      // fine with gcc and ICC

    //return (__m128i)__builtin_ia32_pcmpeqd128((v4si)a, (v4si)b); // bad with ICC
    //return _mm_cmpeq_epi32(a,b);  // not constexpr-compatible
}

See it on the Godbolt compiler explorer, with two test callers (one with variables, one with
constexpr __m128i v1 {0x100000000, 0x300000002}; inputs). Interestingly, ICC doesn't do constant-propagation through pcmpeqd or _mm_cmpeq_epi32; it loads two constants and uses and actual pcmpeqd, even with optimization enabled. The same thing happens with/without constexpr.I think it normally optimizes

gcc does accept constexpr __m128i vector_const { pcmpeqd(__m128i{0,0}, __m128i{-1,-1}) };


GCC (but not clang) treats __builtin_ia32 functions as constexpr-compatible. The documentation for GNU C x86 built-in functions doesn't mention this, but probably only because it's C documentation, not C++.

GNU C native vector syntax is also constexpr-compatible; that's a second option that's again only viable if you don't care about MSVC.

GNU C defines __m128i as a vector of two long long elements. So for integer SIMD, you need to define other types (or use the types defined by gcc/clang/ICC's immintrin.h


(The only weird thing is that static const __m128i foo = _mm_set1_epi32(2); doesn't turn into a constant initializer; it copies from .rodata at runtime, and thus is terrible, using a guard variable which is checked on every function call to see if the variable needs to be statically initialized.)


GCC's xmmintrin.h and emmintrin.h define Intel intrinsics in terms of native vector operators (like *) or __builtin_ia32 functions. It looks like they prefer using operators when possible, instead of (__m128i)__builtin_ia32_pcmpeqd128((v4si)a, (v4si)b);

gcc does require explicit casts between different vector types.

From gcc7.3's emmintrin.h (SSE2):

extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpeq_epi32 (__m128i __A, __m128i __B)
{
  return (__m128i) ((__v4si)__A == (__v4si)__B);
}

#ifdef __OPTIMIZE__
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_shuffle_epi32 (__m128i __A, const int __mask)
{
  return (__m128i)__builtin_ia32_pshufd ((__v4si)__A, __mask);
}
#else
#define _mm_shuffle_epi32(A, N) 
  ((__m128i)__builtin_ia32_pshufd ((__v4si)(__m128i)(A), (int)(N)))
#endif

Interesting: gcc's header avoids an inline function in some cases if compiling with optimization disabled. I guess this leads to better debug symbols, so you don't single-step into the definition of the inline function (which does happen when using stepi in GDB in optimized code with a TUI source window showing.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...