Vector Class Discussion

AVX2 v AVX performance issue

Author: Agner

Date: 2016-05-11 03:14

Duong Tran wrote:

Hello,
I've discovered that the following vectorclass-based code, which transposes a 4x16 matrix of unsigned short integers, performs poorly if compiled for AVX2 instruction set compared to AVX for the same CPU.

The AVX instruction set does not support 256-bit integer vectors, so it splits each 256-bit vector into two 128-bit vectors, where your blend pattern happens to fit perfectly to the punpckhwd and punpcklwd instructions.

The AVX2 instruction set uses 256-bit integer vectors and general shuffle instructions, which happens to be less efficient in your case. The blend16s and permute16s functions in the file vectori256.h has plenty of special cases, but unfortunately not a special case for the unpack instructions.

Reply To This Message

Previous Message

AVX2 v AVX performance issue new - Duong Tran - 2016-05-05

AVX2 v AVX performance issue - Agner - 2016-05-11