Vector Class Discussion

 
thread Performance of decimation v blending - Duong Tran - 2016-05-03
last reply Performance of decimation v blending - Agner - 2016-05-03
 
Performance of decimation v blending
Author:  Date: 2016-05-03 09:38
Hello,

I've been fighting the following problem for weeks and about to give up. But I still hope that I miss something.

The problem originates from a fundamental signal processing task: transpose a 4x8 matrix of 16-bit unsigned short integers and its inverse (i.e. transpose a 4x8 matrix).


typedef Vec8us W8T;

void transpo (W8T y[4], W8T const x[4])
{ // Transpose the 8*4 matrix x and store result into 4*8 matrix y
W8T x0,x1,x2,x3,y0,y1,y2,y3;
x0.load( &x[0] );
x1.load( &x[1] );
x2.load( &x[2] );
x3.load( &x[3] );
y0 = blend8us<0,2,4,6,8,10,12,14>(x0,x1);
y1 = blend8us<1,3,5,7,9,11,13,15>(x0,x1);
y2 = blend8us<0,2,4,6,8,10,12,14>(x2,x3);
y3 = blend8us<1,3,5,7,9,11,13,15>(x2,x3);
x0 = blend8us<0,2,4,6,8,10,12,14>(y0,y2);
x2 = blend8us<1,3,5,7,9,11,13,15>(y0,y2);
x1 = blend8us<0,2,4,6,8,10,12,14>(y1,y3);
x3 = blend8us<1,3,5,7,9,11,13,15>(y1,y3);
x0.stor( &y[0] );
x1.stor( &y[1] );
x2.stor( &y[2] );
x3.stor( &y[3] );
}

void transpi (W8T y[4], W8T const x[4])
{ // Transpose the 4*8 matrix x and store result into 8*4 matrix y
W8T x0,x1,x2,x3,y0,y1,y2,y3;
x0.load( &x[0] );
x1.load( &x[1] );
x2.load( &x[2] );
x3.load( &x[3] );
y0 = blend8us<0, 8,1, 9,2,10,3,11>(x0,x2);
y2 = blend8us<4,12,5,13,6,14,7,15>(x0,x2);
y1 = blend8us<0, 8,1, 9,2,10,3,11>(x1,x3);
y3 = blend8us<4,12,5,13,6,14,7,15>(x1,x3);
x0 = blend8us<0, 8,1, 9,2,10,3,11>(y0,y1);
x1 = blend8us<4,12,5,13,6,14,7,15>(y0,y1);
x2 = blend8us<0, 8,1, 9,2,10,3,11>(y2,y3);
x3 = blend8us<4,12,5,13,6,14,7,15>(y2,y3);
x0.stor( &y[0] );
x1.stor( &y[1] );
x2.stor( &y[2] );
x3.stor( &y[3] );
}

Nice, but... benchmark shows that transpo() is substantially slower than transpi(), at least for SSSE3 instruction set. For example, when compiled under gcc -march=core2 and run on a Core 2 Duo E4500 processor, it is more than twice slower.

After many trials with this and about half dozen more versions of transpo(), I'm about to conclude that (within vectorclass library) one of the two fundamental operations, the _decimation_


y0 = blend8us<0,2,4,6,8,10,12,14>(x0,x1);
y1 = blend8us<1,3,5,7,9,11,13,15>(x0,x1);

is substantially slower than its countperpart, the _blending_


y0 = blend8us<0, 8,1, 9,2,10,3,11>(x0,x1);
y1 = blend8us<4,12,5,13,6,14,7,15>(x0,x1);

A temporary workaround for decimation is via blending:


void transpo (W8T y[4], W8T const x[4])
{ // Transpose the 8*4 matrix x and store result into 4*8 matrix y
W8T x0,x1,x2,x3,y0,y1,y2,y3;
x0.load( &x[0] );
x1.load( &x[1] );
x2.load( &x[2] );
x3.load( &x[3] );
y0 = blend8us<0, 8,1, 9,2,10,3,11>(x0,x2);
y2 = blend8us<4,12,5,13,6,14,7,15>(x0,x2);
y1 = blend8us<0, 8,1, 9,2,10,3,11>(x1,x3);
y3 = blend8us<4,12,5,13,6,14,7,15>(x1,x3);
x0 = blend8us<0, 8,1, 9,2,10,3,11>(y0,y1);
x1 = blend8us<4,12,5,13,6,14,7,15>(y0,y1);
x2 = blend8us<0, 8,1, 9,2,10,3,11>(y2,y3);
x3 = blend8us<4,12,5,13,6,14,7,15>(y2,y3);
y0 = blend8us<0, 8,1, 9,2,10,3,11>(x0,x2);
y1 = blend8us<4,12,5,13,6,14,7,15>(x0,x2);
y2 = blend8us<0, 8,1, 9,2,10,3,11>(x1,x3);
y3 = blend8us<4,12,5,13,6,14,7,15>(x1,x3);
y0.stor( &y[0] );
y1.stor( &y[1] );
y2.stor( &y[2] );
y3.stor( &y[3] );
}

which (of course) is only 1.5 times slower than blending.

But I don't think it is the best vectorclass can do. I'm right in that?

   
Performance of decimation v blending
Author: Agner Date: 2016-05-03 12:47
Try to look at the assembly output. The permute and blend functions work differently for different patterns, depending on whether there is a suitable machine instruction that fits the pattern. Try also with higher instruction sets.