Agner`s CPU blog

Proposal for an ideal extensible instruction set

Author: Hubert Lamontagne

Date: 2016-02-18 19:32

Hmm, I think this is definitely a CISC, and not a RISC-CISC compromise. It does have the one good-but-kinda-expensive feature of CISC: Load-ALU operations.

Some criticism of the proposal:

Instruction format:
For the instruction size, I agree with Jonathan Morton: the second most useful instruction size after 32bits is probably 16bits, because most code uses mostly instructions that would fit in 16bits, resulting in a size gain of theoretically up to 50%. This is why so much 32bit ARM code is compiled in THUMB mode: the same code typically runs 0%-15% faster because the smaller instruction stream compensates for the extra operations needed to cope with the small opcodes (nb: this has probably changed on newer cores!). Instruction sizes larger than 32bits are not that useful because immediates larger than 16bits are rare, and if you have enough registers, you can load oversized immediates beforehand, and remaining operations with large immediates can be decomposed into 2 or 3 operations (such as add r0, 0x3423; add r0, 0x6543 * 65536). Also, large immediates tend to be multiples of 2/4/8/16/32/etc.., which is why ARM's scheme of taking smaller immediates and bitshifting them works. The other alternative to 32/16bit mixed instruction size for reducing code cache size is cramming 2 or more operations per 32bit opcode (which might actually be a good idea!).

I'm also not sold on load-multiple operations. The reason for this is that operations that write to multiple sources are generally bad. From the point of view of the register renamer and out-of-order execution engine, that's 2..N renames to keep track of, and 2..N writebacks to the register file. This means that instruction issue has to stall because subsequent instructions have to wait for all these registers to get renamed. You lose the simplicity of having each instruction be single-result only. That being said, ARM64's compromise of allowing a 2 target load-pair but not more sounds acceptable to me (since ARM64 has to deal with other multi-result instructions anyways).

Registers:
As far as I can tell, separate register files are GOOD. The reason for this is that as you add more read and write ports to a register file, its size grows quadratically (or worse). This makes cpu components larger, which increases propagation time, and increases fanout, and multiplies the complexity of the register renamer. If you give floating point operands their own register file, then aside from load/store, compare and conversion operations, the FPU never has to interact with the rest of the core. So for the same amount of IPC, say, 2 integer 2 float per cycle, separating float operations means you go from a monstruous 8-read 4-write register file and renaming mechanism where both integer ALUs and FP ALUs have to be wired everywhere, to a 2-issue integer unit and a 2-issue FPU. The FPU can have its own register renaming unit, its own scheduler, its own register file, its own writeback unit, its own calculation latencies, and FPU ALUs can be directly wired to the registers, and the whole FPU can live on a different section of the chip. The front end can simply recognize which ops are FPU and queue them there. The same applies to SIMD.

The reason why the integer register file of the CPU isn't also split is that integer operations have a lot of interactions with each other and with loads/stores and jumps, and getting C++ compilers to recognize which partition to put every op/result in quickly turns into an NP-complete problem. The exception to this is Ivan Godard's Mill's belt, which in a 4 ALU design forces each ALU to only write to a different 1/4th of the registers. It might be possible to make a good case for the 68000's idea of separating pointers into a different register file - after all, the C++ compiler knows which operands are pointers. Yes, this increases the number of opcodes (bad), but it decreases the amount of operations competing for the same register ports and ALUs for some given workload (good).

For the whole operand size thing, aside from SIMD and loads/stores there's no reason to have 8 or 16 bit operations. Inversely, 64bit operations where one of the operands is 32bits and gets sign-extended or zero-extended to 64bits are justifiable (ARM64 has them), including in address computations (there's tons of C++ code that does something like array[int index]). Some 32bit operations can be done in 64bit while ignoring the top bits (add, sub, mul, and, or, xor, shl) but not all (shr, asr, comparisons).

Flags:
Add-with-carry and subtract-with-borrow I think are unnecessary because they can be faked with 3 simple operations: add X with Y, compare the sum with X and output 1 if lower but 0 if larger (SLTU on MIPS), add comparison output to sum. ADC and SBC operations are problematic because they're really 3-input 2-output operations (bad), which means that they'll probably have to be broken down into 2 micro-ops which means you'll probably see little gain over the 3 instruction sequence.

Predicate registers:
I'm definitely not sold on the whole predicate thing. As far as I can tell, compilers really don't like issuing conditionals as anything other than conditional branches. Also, if the conditionals can be accurately predicted, then conditional branches are faster because you only execute one side of the branch, and operations downstream can get their inputs earlier (by register renaming, instead of waiting for the predicated instruction results). For remaining cases, a separate CMOV instruction sounds a lot more justifiable to me than spending 3 bits of every single opcode. Also, remember that predicated operations (3-input) are fundamentally different operations from non-predicated operations (2-input), since the old value can be propagated instead of simply being tossed so it needs to be present as an ALU input.

Rounding mode:
I agree that float-to-integer conversions must support at least truncation for the C++ (int) cast, plus floor() and ceil() and round(). Actually it would probably be useful to have an opcode that does floor() or ceil() without the integer conversion as well (for linear interpolation).

Exception control:
My point of view on exceptions is that they're generally bad, since they can basically turn any ALU operation into a potential conditional jump. This forces you to keep the CPU state at the moment of that operation until you're sure that the ALU operation went the right way. Also, they are useless for running C++ code. (Ok, to be fair, you already need to deal with potential page faults on every single load/store so it's not really that much more work, but this is definitely not the kind of thing I'd want to encourage)

Zero masked data:
The reason I can see for not putting this one in is that non-zero-masked ALU operations are actually very different operations and rather complex, since they prevent register renaming and forces you to implement 3-input result merging versions of basically every ALU operation (similar to the predicated operation above). These result-merging operations will probably see little use aside from manipulation tricks in hand-written assembly.

FPGA:
This is probably an okay idea for platforms like game consoles that essentially run a single program. This reminds me of the DSP on stuff like the N64, which you could rewrite the bytecode for (which only Factor5 ever did if I'm not mistaken). But otherwise, I think this is essentially impossible to task switch: you'd need to build the whole FPGA state to be loadable/storable which would probably make the performance pretty bad.

Ok, my turn with suggestions!

2 x ALU Instructions (sequential!):
Your 3rd operand can either be an immediate, register value or memory loaded value. I suggest adding a 4th option: letting this 3rd operand be the result of a simple 2-operand math operation (ie something like add, sub, and, or, xor, bitshifts, maybe mul...), potentially with a small immediate as 2nd operand. ARM already has something similar to this (except the only operation you can do is a bitshift), and the multiply-accumulate is also similar, and load-ALU operations can also be seen as a version of this. This is a very common sequence: a LOT of code is made out of 2-7 successive ALU operations on the same operand. The cost of this is that this is a 2-cycle latency operation reading 3 register ports. The benefits is that you're getting 2 instructions for the price of 1: you can squeeze it in a 32-bit opcode (=increased code density without resorting to 16bit opcodes), it's only 1 instruction for the front-end, it saves 1 register read over the equivalent 2 instruction RISC sequence (3 reads instead of 4). But most importantly, this saves 1 register WRITE, which lets you reduce the number of register renames for a given block of operations and reduce the number of write ports to your register file.

Software pipelining:
For software pipelining, I actually have an interesting design to propose, which is similar to Jonathan Morton's proposal but doesn't use a stack. I like to call "SIME" (single instruction multiple execution). For a 8*32bit SIME unit, you'd build it as 8 simple MIPS-like cores, each 1 instruction-per-cycle in-order. But only the first core has a front-end: once the first core executes an instruction, it queues the same instruction to the second core (which will in turn queue it to the 3rd core etc). For values involving feedback (for instance an accumulator), you also have a data queue going from the 1st core to the 2nd, and from 2nd core to 3rd, and so forth, with a queue from 8th core to 1st to provide looping, and ALU operands can come from either registers or from the queue from the previous core, and likewise results can be queued to the next core in addition to being written to the register files. All load/store operations are inherently gather/scatter operations since they execute sequentially on each successive core. For conditionals, cores 2-8 check that the conditional being evaluated produces the same result as on core 1, and if it doesn't, some fallback mechanism is activated (this is generally used to do a number of iterations that isn't a multiple of 8). When an interrupt/task switch/page fault happens, the OS needs special opcodes to load/store values from the instruction queues and data queues to save/restore the state. This system could be extended for larger CPUs by either making it superscalar (2-issue in order or even out-of-order), adding SIMD on top of SIME or adding more cores (it's probably not too hard to design it so that the number of cores can be changed without changing the instruction stream, aside from OS state loading/saving). Unfortunately I don't think C++ compilers can automatically produce the kind of loop that would run on this, due to the fact that memory loads/stores are reordered (unless either pointer aliasing detection gets much better, or a Transmeta Crusoe-style load/store aliasing resolution mechanism is provided).

Reply To This Message

Previous Message

Next Message