I would like to propose a gather() method as an extension to the vectorclass library that would allow to load a vector elements from
arbitrary memory locations specified by an index which is itself a vector register.
I have found such a gather() instruction to be an invaluable tool for 'real-world' applications where the data items are not usually distributed contigously in memory.
Unfortunately the gather() instruction is missing in all instruction sets prior to AVX2 so it would have to be emulated in software something like this: inline __m128 vgather4(const float *base, const __m128i &indices)
{
__m128 res = _mm_load_ss(base+_mm_cvtsi128_si32(indices));
res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,1)),_MM_MK_INSERTPS_NDX(0,1,0));
res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,2)),_MM_MK_INSERTPS_NDX(0,2,0));
res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,3)),_MM_MK_INSERTPS_NDX(0,3,0));
return res;
} inline __m256 vgather8(const float *base, const Vec8i &indices)
{
const __m128 low = vgather4(base,indices.get_low()),
high = vgather4(base,indices.get_high());
return _mm256_insertf128_ps(_mm256_castps128_ps256(low),high,1);
} This looks like a lot of overhead (it compiles to 18 assembly instructions to basically load 8 values from memory so if one does only
one or two computations after loading a vector, there will be little performance benefit.
However there are many usage cases where data can be re-used many times for computations once loaded in a vector register so there is potential for significant speedup with the gather() instruction (for example I was able to get a speedup of a factor of three for a 3D grid tri-cubic interpolation code for fluid simulation advection using the above gather() instruction on top of <vectorclass.h>) I wonder however if there may be a faster method for emulating gather() with pre-AVX2 means ?
(under AVX1 one could use an implementation basd on _mm256_blend_ps/_mm256_broadcast_ss but I am not sure if this would be faster though) Usage example: simultaneously loading 8 pixels at 8 arbitrary coordinate pairs (ix,iy) from a 2D grid: Vec8i ix,iy,grid_width;
float *p_grid_base;
...
ix = min(max(ix,0),grid_width); ... // calculate & clamp coords etc.
Vec8i idx = ix + iy * grid_width; // calc indices
Vec8f pixel_value = vgather8( p_grid_base, idx ); // load 8 pixels in parallel |