Intel Floating Point Executing 3 to 4 Times Faster Than it Should. MAKES NO SENSE
Posted: 2021-10-04, 5:09:12
Hello Agner. I would sure like an explanation (if you have one) for the insanely fast floating point speeds I'm getting on Ivy Bridge and Haswell processors that seem to defy the laws of physics.
Last year I was benchmarking code I was writing to calculate gangs of sine waves for additive synthesis using SSE in assembly language. All I did was change one instruction at the start of the loop from something like:
movss xmm0,[esi]
movss xmm1,xmm0
to:
movss xmm0,[esi]
movss xmm1,[esi]
I wanted to get rid of the dependency. That caused the entire loop of about a dozen instructions to execute about 4 times faster. The only explanation I could think of was that Intel was so clever that they were using other unused ALUs in the vector processor for running scalar code. But when I vectorized the code, it ran just as fast. I then moved some old Pentium x87 floating point code for doing 3D transforms to my Ivy Bridge processor. The loop contains 9 multiplies, 9 adds, moves of data in and out of registers, loop overhead, and a few dependencies. It takes about 35 clock cycles on a Pentium but only about 12 on an Ivy Bridge (that's an average of 0.66 cycles per floating point instruction). Ivy Bridge Multiply latencies are even longer than those on the Pentium (5 clocks vs 3), and still it's faster. I had no idea Intel was working on speeding up the obsolete x87 and yet it's insanely fast.
I just spent a day slowly deleting instructions from loops and trying different arrangements of instructions to try to understand what's going on. I encountered a number of things that didn't make much sense, but I found a really big one that really affects code speed. Here's an example:
This loop executes in 5 clock cycles (registers filled with 1.0 so as not to overflow):
loop:
mulss xmm0,xmm1
dec rcx
jnz loop
This loop executes in 1 clock cycle:
loop:
movss xmm0,[rsi]
mulss xmm0,xmm1
dec rcx
jnz loop
Pre-loading the destination register of the multiply speeds up this loop by 5 times. Loading the source register does not speed it up. There's no way that loop should be able to run faster than the 5 clock cycle latency of the multiply instruction, and yet it does. This should be impossible. Even more bizarre is the fact that I've added an additional dependency with the movss instruction. This seems to be one of the main reasons I'm getting the insane speeds I am. I would really like some kind of insight into what might be going on if you have any idea.
Thanks,
Elhardt
Last year I was benchmarking code I was writing to calculate gangs of sine waves for additive synthesis using SSE in assembly language. All I did was change one instruction at the start of the loop from something like:
movss xmm0,[esi]
movss xmm1,xmm0
to:
movss xmm0,[esi]
movss xmm1,[esi]
I wanted to get rid of the dependency. That caused the entire loop of about a dozen instructions to execute about 4 times faster. The only explanation I could think of was that Intel was so clever that they were using other unused ALUs in the vector processor for running scalar code. But when I vectorized the code, it ran just as fast. I then moved some old Pentium x87 floating point code for doing 3D transforms to my Ivy Bridge processor. The loop contains 9 multiplies, 9 adds, moves of data in and out of registers, loop overhead, and a few dependencies. It takes about 35 clock cycles on a Pentium but only about 12 on an Ivy Bridge (that's an average of 0.66 cycles per floating point instruction). Ivy Bridge Multiply latencies are even longer than those on the Pentium (5 clocks vs 3), and still it's faster. I had no idea Intel was working on speeding up the obsolete x87 and yet it's insanely fast.
I just spent a day slowly deleting instructions from loops and trying different arrangements of instructions to try to understand what's going on. I encountered a number of things that didn't make much sense, but I found a really big one that really affects code speed. Here's an example:
This loop executes in 5 clock cycles (registers filled with 1.0 so as not to overflow):
loop:
mulss xmm0,xmm1
dec rcx
jnz loop
This loop executes in 1 clock cycle:
loop:
movss xmm0,[rsi]
mulss xmm0,xmm1
dec rcx
jnz loop
Pre-loading the destination register of the multiply speeds up this loop by 5 times. Loading the source register does not speed it up. There's no way that loop should be able to run faster than the 5 clock cycle latency of the multiply instruction, and yet it does. This should be impossible. Even more bizarre is the fact that I've added an additional dependency with the movss instruction. This seems to be one of the main reasons I'm getting the insane speeds I am. I would really like some kind of insight into what might be going on if you have any idea.
Thanks,
Elhardt