SSE replacement for FPREM1
Posted: 2021-09-19, 22:30:25
I'm working on patching an old piece of code used to reduce the angles of a 3D float vector to the range ±Pi. The original code used loops to implement a horribly inaccurate version of IEEE remainder.
Replacing the loops with FPREM1 2Pi has been working well so far, but I'd really like to use SSE instructions instead since FPREM1 is slow and the angles can be easily loaded into an XMM register to process as packed singles. The optimization guide recommends to "Multiply by the reciprocal divisor, get the fractional part by subtracting the truncated value, and then multiply by the divisor," but this frequently isn't producing correct results when the input angles are a multiple of Pi.
Is there a simple way to make the SSE version more accurately behave like true IEEE remainder?
Replacing the loops with FPREM1 2Pi has been working well so far, but I'd really like to use SSE instructions instead since FPREM1 is slow and the angles can be easily loaded into an XMM register to process as packed singles. The optimization guide recommends to "Multiply by the reciprocal divisor, get the fractional part by subtracting the truncated value, and then multiply by the divisor," but this frequently isn't producing correct results when the input angles are a multiple of Pi.
Is there a simple way to make the SSE version more accurately behave like true IEEE remainder?