Agner wrote:
The AES instruction doesn't come in vector sizes bigger than 128 bits. That's why I am not using it.
Is that a big issue? The AES instruction has a latency of 7 on Sandy bridge and later, but can be started every 1 cycle. Therefore by generating random numbers in 8 blocks, the theoretical peak performance is 0.64 cpB (though in practice I only managed 0.8cpB per core), which I believe is better than most RNG out there. I have only saw Intel MKL's SFMT19937 and MT2203 being better (0.5cpB, the last time tested them).Anyway, it is indeed a shame that the newer VEX encoded AES instructions only work on the lower part XMM of YMM |