Agner`s CPU blog

Proposal now published

Author: Hubert Lamontagne

Date: 2016-03-30 02:46

Agner wrote:

Hubert Lamontagne wrote:

The way MIPS stays simple is that they disallow any sort of multi step situational thing with lots of changing registers and multiple memory accesses... No push/pop, no call/return, no automatic loading of selector offsets and segment size like on 286, and especially no task switch instruction.
This is also why RISC-V is designed that way: the idea is that by having no multi result instructions, but having higher throughput because it's easier to make a CPU with out of order execution and lots of execution units, you still come ahead in overall speed.
Why is it so much more difficult to have two-result instructions?

I admit, it's not that much more difficult to have two-result instructions. But it has a cost: a 4-issue CPU ends
up potentially writing to 8 registers per cycle. This means that you need a register file with 8 write-ports, which
takes up more space and has a higher latency. Your register renamer also needs 8 write-ports instead of 4,
and the potential number of conflict scenarios that have to be broken down at the issue stage goes up. If you
have a pentium-pro style pipeline where results are committed to a permanent register file in-order, this also
goes up from 4 write ports to 8. If you have an R10000 style pipeline where the permanent register file is only
for renaming, then you have a different but similar problem: you have to queue 8 now-reusable registers to
the register renamer per cycle instead of 4.

You could also limit the number of instructions you issue if they take up too many write ports - for instance, if
you have 6 write ports, you can check on each cycle if you're going to use up too many write ports and only
issue 3 instructions to prevent the 7-or-8-writes-on-a-single-cycle scenario described above. But then you
need more arbitration circuitry and your potential benefit from multi-result instructions goes down.

I guess it all comes down to what's your limiting factor:
- If your limiting factor is issue width (how many different instructions per cycle you can issue), then multi-result
instructions are good because you're doing more work out of your few available instructions. x86 tends to
fall in this case due to the whole instruction length business, and good x86 designs tend to do lots of work
from few instructions (AMD's 3-issue Athlon is a perfect example of this - and a good example of "fast CISC").
- If your limiting factor is register-file and rename ports, then multi-result instructions are bad because they
won't be faster than multiple instruction sequences and they make the pipeline more complex overall (since
the multiple results have to be committed together and so forth).
- If your limiting factor is L1 cache read and write ports, then it's all a wash.

We need two-result instructions for: add-with-carry,

I don't think ADC is used often enough to warrant being included in a general purpose instruction set. It makes
perfect sense for 8bit and 16bit processors, but for 32bit processors, you rarely - if ever - do 64bit calculations.
This counts double for 64bit processors: ADC only ever appears if you want to do 128bit calculations (or larger),
which is even less common, and it only saves 2 instructions and 1 cycle latency over the equivalent MIPS sequence
(using compare-and-set-to-1-if-larger-or-equal to generate carry and adding it in separately).

overflow detection,

Careful use of comparison instructions handle this case adequately, as far as I can tell. For instance, when doing
unsigned addition, you only have to compare the result with a source operand: if the result is smaller, you have a
100% guaranteed wrap. Or, since most integer operations happen on 32bit ints, you can generate an oversized
64bit result and check for overflow separately. Furthermore, these overflow checks happen outside of the critical
path, so unless the instruction stream is saturating the CPU's ALUs, these checks are basically free.

The other option is to generate trap interrupts on overflows, but this generally can't be used in high-level language
interpreters (too coarse grained, hard to recover from an interrupt), or in C++ (no support, programmers expect
int32_t calculations to wrap), and it's not particularly useful in ASM either (more speed-oriented than security-oriented).

pop register,

Ok, this one is actually fairly common in general purpose code. The MIPS equivalent is pretty okay too though: load
register + increment sp. This is especially OK if you pop multiple registers at the same time, in which case all the
sp increments can be combined together into one large increment at the end and the CPU doesn't have to keep
track of the intermediate values of sp. It generates a sequence like this:
ld r4 [sp + #0],
ld r5 [sp + #4],
ld r6 [sp + #8],
ld r7 [sp + #12],
add sp #16

On out-of-order CPUs, this might run faster than 4 pop instructions because it typically generates 5 micro-ops
instead of 8. Other times, the execution speed of that sequence is limited by L1 data cache ports so there's no
speed difference.

If you only have to pop a single value, then the splitting hurts more, but then your function is likely to be inlinable or
a leaf-function.

return

Return falls into more or less the same case as pop register: it's fairly common, enough to warrant special
consideration, but the MIPS sequence (ld r31 [sp + #x], add sp #4, jmp r31) also handles it well (since
loading/storing the link register can be combined with the rest of the loads/stores to stack so it often
doesn't generate any extra sp updates). Doing it with a single complex instruction rather than multiple simple
ones often doesn't generate much win since the number of overall memory operations and state changes
doesn't change ('return' afaik is always a 2 or 3 micro-op instruction on x86).

read-and-increment-pointer,

That case is similar to pop register: on one hand, you get two results for the price of one if your front-end can
only generate a limited number of instructions (which is why ARM has this instruction), but on the other hand,
separating reading and pointer-increment lets you combine a whole bunch of updates to the same pointer
together, which is often good since it reduces the number of intermediary results for the pointer value (and
removes the false dependency between multiple consecutive reads to the same incremented pointer).

If you look at later ARM cpus, read-and-increment instructions often have speed penalities since the
underlying CPU can't really handle the extra generated values (too few register write ports etc) so there's
often very little gain over using separate load and increment instructions.

ALU-and-conditional-jump.

That one is more interesting because it's not really a multiple-result instruction... The ALU result and jump
go to different parts of the retirement unit (regfile writeback and branching respectively), which is why
combined compare+branch appear in many RISC instruction sets (including MIPS).

If we want an efficient implementation of add-with-carry with a single register, we may have an extra bit in the register for this purpose. Or we may use every second element in a vector for carry bit, at the cost of getting half the work done. The same for overflow detection.
Function calling also becomes complicated if we do not allow multiple results. The call of a non-leaf function will be: (1) copy instruction pointer to link register, (2) jump to target, (3) subtract from SP, (4) save link register on stack. And the return will be: (5) recall link register from stack, (6) add to SP, (7) jump to link register. That is seven instructions instead of two. Does the increase in speed really make up for that?

Step (1) and (2) are typically a single instruction (since one retires to the regfile and one retires to the
branch unit), so it's not a problem. Function calls are likely to generate a whole bunch of memory
loads/stores - typically to object member variables in C++ - so it's very likely there will be a free pipeline
ALU slot for the SP updates (3) and (6), and as stated above you only need a single SP update no
matter how many registers you're loading/storing. If a cache miss or branch misprediction happens
(which is probably most likely near function starts and ends), then you'll likely have dozens free ALU
cycles that you can use to deal with SP. Another common case is that function call/returns often have
series of instructions limited by data cache latency (ex: loading a pointer, then using it to read from RAM),
in which case you also have many 'free' ALU cycles.

Reply To This Message

Previous Message

Next Message