I have always assumed that the extra-uop inserted to merge flags behavior described in the uarch manual applies in Skylake also (the manual also mentions that it does). Recently, however, during a discussion about in exactly which situations the merging uop is inserted [1] I tried measuring this presence of the merging uop and didn't find any at all even in the cases where one should definitely occur regardless of the precise behavior being debated in [1]. For example, I ran this test:
xor eax, eax
.top:
%rep 128
add rcx, 5
inc rax
jna .never
%endrep
dec rdi
jnz .top
ret
.never:
ud2
The jna instruction reads ZF and CF which are set by inc and add respectively, so this should certainly require a "merge" uop to be inserted. However, none of the performance counters I checked, including the uops executed counters for all the ports, showed any evidence of the merging uop. For the 3 instruction sequence of add, inc, jna I always saw a total of 3 uops (note that there is no macro-fusion). This was true for the test above, and once that should not need any merging uop (e.g., reversing the position of the add and inc instructions). All tests ran in 1.25 cycles for those 3 instructions, which I guess is the result of occasional port conflicts. In your tests did you observe the flag merging uop via performance counter, or indirectly in some other way? If it was via the performance counters, should the above test show evidence of the merging uop? Is it possible it has been eliminated in Skylake?
[1] In particular, whether the condition is (1) that a flag reading instruction reads any flag set by an instruction which is not the last flag-setting instruction or instead (2) that a flag-reading instruction reads a set of flags coming from two different instructions. |