Agner`s CPU blog

Do we need instructions with two outputs?

Author: Hubert Lamontagne

Date: 2016-04-01 02:46

I've thought about the "assigning a flag register for each register" thing a bit, and its problem is not only that saving/restoring by the OS on interrupts, but also that it makes all callee-saved registers in the calling convention double-expensive - because the leaf function not only has to save the register value, but also its matching flags. You get a conflict between the principle that "flags aren't saved on function calls" (because they get wiped too easily and you pretty much never want to preserve them anyways) and the principle that "flags should have the same calling convention as their matching registers" (since they presumably come from the same register rename and update together). It also does all sorts of weird things like change 'nops' into different operations (mov r0, r0; add r0, #0; and r0, #-1; and the st r0, [sp+0] + ld r0, [sp+0] sequence all become different due to flag effects). Mov preserves flags (so that you can make 'mov' happen 'for free' by doing it at the register rename stage) but mov immediate doesn't. This is bad for late compiler optimization passes (because it has to take flags into account). So I think having flags be a part of value registers creates more problems than it solves.

If Bignums are really important and we need fast ADC for stuff like RSA encryption, I suggest we should make ADC/SBC vector-unit only, and have specific flags registers attached to the vector unit (to reduce the number of dependency paths from the SIMD unit to the main unit). Also, I'd separate the flags that are a result of the SIMD operations (ie carry) from the flags that control SIMD operations (zeroing, denormal-zeroing, etc), so that SIMD operations that update flags can simply wipe the previous value - Partial flag register updates are bad since it requires separate flag rename engine for every part that doesn't update together!

The exception to this is vector length (for variable vector length), which has to be on the integer-scalar unit because it has to be known by load/store instructions.

Ideally, it's best if SIMD instructions cannot cause interrupts and can't affect program flow. For most architectures, SIMD operations execute on a different unit with a different issue queue, so it's best if non-SIMD and SIMD operations can be separated as fast and easily as possible - basically right after determining instruction size, since operations go to different queues and compete for different register ports and so forth. In theory, you could even design a cpu with a separate IP and instruction cache for the SIMD unit, and do all SIMD loads/stores through a queue (the PS3's Cell is sorta like this, in a way).

For instance, on the ARM Cortex A8, NEON/FPU instructions literally CANNOT cause interrupts since the non-SIMD instruction results don't even commit at the same time, so SIMD instructions and every subsequent instruction have to be 100% sure to run (because the result of subsequent non-SIMD has already been committed so it cannot be undone). The benefit of this is that the non-SIMD commit unit doesn't even have to know that the SIMD unit even exists except for receiving store values in a queue, and the SIMD unit likewise knows nothing about the non-SIMD unit except that instructions and values loaded from memory and forwarded from GPRs arrive in a queue and that store values go in a queue.

X86 enforces this less strongly (so that committing has to be synchronized between the general-purpose unit and the SIMD unit) but even then, there's a reason why, on the Athlon, COMISS (SSE float compare and update main unit flags register) runs on the Vector Path (microcode!) - and it's not because AMD engineers thought people wouldn't use COMISS.

Basically, I don't think orthogonality between non-SIMD instructions and SIMD instructions is a good idea, since they have different goals: non-SIMD instructions have to have as few weird side effects as possible and retire as fast as possible, so that they can be renamed and reordered and jumbled and rolled-back if prediction has failed (or a load/store caused a page fault). SIMD instructions just have to do as much math as possible per cycle - they're all about throughput, so it doesn't matter if they take 4 or 5 cycles to complete, if they can't be reordered and so forth - which is why VLIW is popular in cpus designed for DSP (they don't have to run C++!).

SIMD-oriented code also tends to be more likely to be simple short loops so I don't think it really has to be particularly compact. Also the memory bandwidth usage for instructions will probably be totally dwarfed by the memory bandwidth usage for data in SIMD code anyways.

I don't think saving/restoring the register file on task switch using 32 consecutive loads/stores (+sp update) is THAT much of a problem because task switches cause other much slower side effects - for instance, you're likely to get a whole bunch of instruction cache misses and data cache misses and TLB evictions and TLB misses and branch prediction misses and the cache prefetcher getting confused - those are many times more costly.

To handle interrupts, you do need a few scratchpad registers that are only accessible to the OS, for saving the previous values of SP + IP + OS/user mode + a couple extra registers. This is to get work space to "bootstrap" the interrupt handler state save/restore. Early MIPS had the problem that it didn't really have those system-reserved special registers, so you unfortunately lost a couple of general purpose registers instead.

You also probably need the hardware TLB to have different memory mappings for OS and user and switch automatically between those. Another way to deal with this is having a few banked registers (typically SP and the last few registers - just enough to initiate state saving). Even though this makes interrupt handler prologues kinda ugly, it also removes the need for microcoded interrupt handlers (which are often somewhat bypassed by the OS anyways).

Reply To This Message

Previous Message

Next Message