Duong Tran wrote:
Hello, I've discovered that the following
vectorclass-based code, which transposes a 4x16 matrix
of unsigned short integers, performs poorly if
compiled for AVX2 instruction set compared to AVX for
the same CPU. The AVX instruction set does not support 256-bit integer vectors, so it splits each 256-bit vector into two 128-bit vectors, where your blend pattern happens to fit perfectly to the punpckhwd and punpcklwd instructions. The AVX2 instruction set uses 256-bit integer vectors and general shuffle instructions, which happens to be less efficient in your case. The blend16s and permute16s functions in the file vectori256.h has plenty of special cases, but unfortunately not a special case for the unpack instructions. |