Vector Class Discussion

 
thread AVX2 v AVX performance issue - Duong Tran - 2016-05-05
last reply AVX2 v AVX performance issue - Agner - 2016-05-11
 
AVX2 v AVX performance issue
Author:  Date: 2016-05-05 18:05
Hello,

I've discovered that the following vectorclass-based code, which transposes a 4x16 matrix of unsigned short integers, performs poorly if compiled for AVX2 instruction set compared to AVX for the same CPU. For example, on the Intel Core i5-4460 (Haswell) CPU (actually on the same machine), the code compiled with gcc -march=native or -march=corei7-avx2 is three times slower than itself compiled with -march=corei7-avx.

Something must be wrong there, but I don't know where. It's microarchitecture, compiler, or library (vectorclass) issue?


typedef Vec16us X16T;

inline void transpi (X16T y[4], X16T const x[4])
{ // transpose a 4x16 matrix x into a 16x4 matrix y
X16T x0,x1,x2,x3,y0,y1,y2,y3;
x0.load( &x[0] );
x1.load( &x[1] );
x2.load( &x[2] );
x3.load( &x[3] );
y0 = blend16us< 0,16, 1,17, 2,18, 3,19, 4,20, 5,21, 6,22, 7,23>(x0,x2);
y1 = blend16us< 8,24, 9,25,10,26,11,27,12,28,13,29,14,30,15,31>(x0,x2);
y2 = blend16us< 0,16, 1,17, 2,18, 3,19, 4,20, 5,21, 6,22, 7,23>(x1,x3);
y3 = blend16us< 8,24, 9,25,10,26,11,27,12,28,13,29,14,30,15,31>(x1,x3);
x0 = blend16us< 0,16, 1,17, 2,18, 3,19, 4,20, 5,21, 6,22, 7,23>(y0,y2);
x1 = blend16us< 8,24, 9,25,10,26,11,27,12,28,13,29,14,30,15,31>(y0,y2);
x2 = blend16us< 0,16, 1,17, 2,18, 3,19, 4,20, 5,21, 6,22, 7,23>(y1,y3);
x3 = blend16us< 8,24, 9,25,10,26,11,27,12,28,13,29,14,30,15,31>(y1,y3);
x0.stor( &y[0] );
x1.stor( &y[1] );
x2.stor( &y[2] );
x3.stor( &y[3] );
}

   
AVX2 v AVX performance issue
Author: Agner Date: 2016-05-11 03:14
Duong Tran wrote:
Hello,

I've discovered that the following vectorclass-based code, which transposes a 4x16 matrix of unsigned short integers, performs poorly if compiled for AVX2 instruction set compared to AVX for the same CPU.

The AVX instruction set does not support 256-bit integer vectors, so it splits each 256-bit vector into two 128-bit vectors, where your blend pattern happens to fit perfectly to the punpckhwd and punpcklwd instructions.

The AVX2 instruction set uses 256-bit integer vectors and general shuffle instructions, which happens to be less efficient in your case. The blend16s and permute16s functions in the file vectori256.h has plenty of special cases, but unfortunately not a special case for the unpack instructions.