Agner wrote:
Is there a limitation on decoding short instructions? Is this documented anywhere?
I'm not sure if it really is predecoder's limitation. For example,
or reg, reg
or reg, reg
or reg, reg
mov reg, [reg]
This code sequence should ideally run at 1 clock / 4 instructions. When I change the instruction length from 2 to 4 bytes using these variants:
or r32, r32 : 2B OR
or r64, r64 : 3B OR
or r64, 1 : 4B OR
mov r32, [reg] : 2B MOV
mov r64, [reg] : 3B MOV
mov r64, [reg+8] : 4B MOV
The results are:
inst. clock/4insts.
pattern $miss $hit
-------- ----- -----
2+2+2+2 1.0, 1.0
3+2+2+2 1.13 1.13
3+3+2+2 1.25 1.19
3+3+3+2 1.31 1.0
3+3+3+3 1.21 1.15
4+3+3+3 1.16 1.0
4+4+3+3 1.0 1.10
4+4+4+3 1.0 1.16
4+4+4+4 1.0 1.0
So it seems there are some limitations regarding instruction count in 16B (or larger) code block, for both legacy decoder and uop cache. |