I'm running following code in 32-bit mode and 64-bit mode. Outer loop (1k iterations) does one operation using YMM registers followed by VZEROUPPER. Then inner loop (1M iterations) runs SSE2 instructions:
Code: Select all
/* Test loop with VZEROUPPER & YMM. */
asm volatile (
".align 64\n"
"pcmpeqd %%xmm0, %%xmm0\n"
:
:
: "memory"
);
j = 0; i = 0;
start_cycles = rdtsc();
asm volatile (
".align 64\n1:\n"
"vmovdqa %%ymm0, %%ymm1\n"
"vzeroupper\n"
"xor %0, %0\n"
".align 64\n2:\n"
"movdqa %%xmm0, %%xmm1\n"
"paddd %%xmm1, %%xmm0\n"
"lea 1(%0),%0\n"
"cmp %2,%0\n"
"jb 2b\n"
"lea 1(%1),%1\n"
"cmp %3,%1\n"
"jb 1b\n"
: "+r" (j), "+r" (i)
: "r" (NUM_INNER_ITER), "r" (NUM_ITER)
: "memory", "cc"
);
end_cycles = rdtsc();
printf("%6s-test3: %ld*%ld iterations, %lld cycles\n", arch,
(long)i, (long)j, (long long int)(end_cycles - start_cycles));
Removing `vmovdqa %%ymm0, %%ymm1` from outer loop makes problem go away. Also if I use VEX instructions in inner loop the is no slow-down seen. Switching VZEROUPPER to VZEROALL does not appear to be making difference... actually replacing `vmovdqa %%ymm0, %%ymm1` & vzeroupper with just vzeroall makes the same problem appear.
Questions I now have are,
1. Have anyone else seen the same?
2. Am I doing something wrong?
3. How to avoid this slow down? I have 32-bit x86 code using mixed VEX & non-VEX portions with proper VZEROUPPER/VZEROALL after YMM usage but non-VEX parts are now running slower than expected with Zen4... I'd like to find way to avoid that (going 64-bit is obvious one, but this is in open-source library and 32-bit builds are still a thing).
I've attached C source-code that I've been using for testing this, compiles with GCC:
Code: Select all
$ gcc -m64 -O2 -Wall zen4_ymm_32bitmode.c -o zen4_ymm_32bitmode
$ ./zen4_ymm_32bitmode
x86-64-test1: 1000*1000000 iterations, 1020975689 cycles
x86-64-test2: 1000*1000000 iterations, 1017973047 cycles
x86-64-test3: 1000*1000000 iterations, 1006376455 cycles
x86-64-test4: 1000*1000000 iterations, 1009606154 cycles
x86-64-test5: 1000*1000000 iterations, 1002171365 cycles
x86-64-test6: 1000*1000000 iterations, 1002110782 cycles
$ gcc -m32 -O2 -Wall zen4_ymm_32bitmode.c -o zen4_ymm_32bitmode
$ ./zen4_ymm_32bitmode
i386-test1: 1000*1000000 iterations, 1003244375 cycles
i386-test2: 1000*1000000 iterations, 1002098374 cycles
i386-test3: 1000*1000000 iterations, 2003160445 cycles
i386-test4: 1000*1000000 iterations, 2003677727 cycles
i386-test5: 1000*1000000 iterations, 2003784511 cycles
i386-test6: 1000*1000000 iterations, 1002279888 cycles