Hi, I think the VC++ optimiser may have a problem with permute8 when generating the _mm256_permutevar8x32_epi32 path. Vec8i permute8(Vec8i const a); I was converting some AVX2 intrinsic code to use VCL and noticed a performance loss so I investigated the assembly output (VS 2017, release build, x64) and found that the generated code for permute8 is not optimal for the _mm256_permutevar8x32_epi32 path due to the way it creates the permmask. This is my calling code:
const Vec32uc src = ...;
pair1_i = permute8<0, 4, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
pair2_i = permute8<1, 5, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
pair3_i = permute8<2, 6, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
pair4_i = permute8<3, 7, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
The constant8ui<...>() call inside permute8 to create the permmask causes the VC++ optimiser to produce something like this:
mov DWORD PTR u$765[rbp], r8d
mov QWORD PTR u$765[rbp+4], 4
mov QWORD PTR u$765[rbp+12], 0
mov QWORD PTR u$765[rbp+20], 0
mov DWORD PTR u$765[rbp+28], r8dvmovdqu ymm0, YMMWORD PTR u$765[rbp]
vpermd ymm1, ymm0, ymm2
So when I call permute8 four times in a row, each one has a group of movs before the vpermd call. I commented out the constant8ui call and replaced it with this:
const __m256i permmask = _mm256_set_epi32(i7 & 7, i6 & 7, i5 & 7, i4 & 7, i3 & 7, i2 & 7, i1 & 7, i0 & 7);
That produces assembly like this:
vmovdqu ymm0, YMMWORD PTR __ymm@0000000000000000000000000000000000000000000000000000000400000000
vpermd ymm0, ymm0, ymm3
And I get much better performance, equal to my original intrinsic code. So it might be worth introducing an ifdef for _MSC_VER. Thanks,
Neil |