It's quite often that we need to convert data stored in uint8/uint16 (ex. image or video) to int32 first before we can do the calculation in 32-bit floating-point. In SSE4.1 there are _mm_cvtepu*_epi32. In AVX2 there are _mm256_cvtepu*_epi32. They should perform faster than the usual extend_low()+extend_high() method to get the final 32-bit integer from 8-bit integer, when SSE4.1 or AVX2 is available. I wonder if you can add another type of conversion functions for direct (u)int8-to-int32 or (u)int16-to-int32, utilizing the intrinsics in SSE4.1 and AVX2. Best regards. |