This looks like an alignment issue. The code is fetched in 16-bytes blocks. Instructions that cross a 16-bytes boundary (or 32-bytes boundary?) are decoded less efficiently. The µop cache is coupled to the instruction cache with a maximum of three 6-µop entries per 32 bytes block of code. How this translates to inefficiency when instructions with certain lengths execute out of the µop cache, I don't really understand.
I have done some experiments to test your claim that fuseable instructions decode less efficiently:
xchg r8,r9 ; 3 µops. Decodes alone
or eax,eax ; 1 µop, D0
or ebx,ebx ; 1 µop, D1
or ecx,ecx ; 1 µop, D2
or edx,edx ; 1 µop, D3
This decodes in 2 clocks. If the last OR is changed to an AND, it decodes in 3 clocks. It will not put a fuseable arithmetic/logic instruction in decoder D3 because then it can't check in the same clock cycle if the next instruction is a branch. There is no effect when this executes out of the µop cache. |