Interesting. So it sounds like the odd rule also exists in the uop cache territory?
Here is another example:
or rax, 1
or rdx, 1
or rsi, 1
movaps xmm0, [r10]
or rdi, 1
or r8, 1
movaps xmm1, [r11]
or r9, 1
This runs at 2 clocks / 8 instructions regardless of uop cache hit/miss. But if all ORs are changed into AND, it drops to 2.45 clocks / 8 instructions when the code isn't fit into the uop cache.
Of course,
and rax, 1
and rdx, 1
and rsi, 1
movaps xmm0, [r10]
and rdi, 1
and r8, 1
and r9, 1
movaps xmm1, [r11]
This runs at 2 clocks / 8 instructions without problem.
The result means not only that decode throughput of AND instruction is limited to 3 / cycle, but also that 4-1-1-1 pattern rule is applied to the instruction. This makes me believe that macro-fuseable instructions are only handled in simple decoders. |