I'm reading in pixel data where each pixel is an integer in RGBA format. I first unpack the the four bytes to flour ints and then convert to floats. One way I could do this is to use extend_low/high. But I would have to do this four times to get four integers. Instead I think it's more efficiency to use _mm_cvtepu8_epi32 intrinsic which unpacks four bytes directly to four ints. Is there a reason this intrinsics is not used by the vectoclass? Here is the code I use now which unpacks four pixels into 12 floats.
void int4_to_float12(int *x, float*y, const int offset) {
//load 4 pixels, convert them from AoS to SoA, expand them to 12 floats
Vec16uc c16= Vec16uc().load(x);
//RGBARGBARGBARGBA -> 4xRRRRGGGGBBBB
Vec4ui i4 = (Vec4ui)permute16uc<
0, 4, 8, 12,
1, 5, 9, 13,
2, 6, 10, 14,
3, 7, 11, 15>(c16); Vec4ui row0 = _mm_cvtepu8_epi32(permute4ui<0,-1,-1,-1>(i4)); //RRRR
Vec4ui row1 = _mm_cvtepu8_epi32(permute4ui<1,-1,-1,-1>(i4)); //GGGG
Vec4ui row2 = _mm_cvtepu8_epi32(permute4ui<2,-1,-1,-1>(i4)); //BBBB
//Vec4ui row3 = _mm_cvtepu8_epi32(permute4i<3,-1,-1,-1>(i4)); //AAAA to_float(row0).store_a(&y[0*offset]);
to_float(row1).store_a(&y[1*offset]);
to_float(row2).store_a(&y[2*offset]);
//to_float(row3).store_a(&y[3*offset]);
} |