Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Do we need instructions with two outputs?
Author: Hubert Lamontagne Date: 2016-04-01 02:46
I've thought about the "assigning a flag register for each register" thing a bit, and its problem is not only that saving/restoring by the OS on interrupts, but also that it makes all callee-saved registers in the calling convention double-expensive - because the leaf function not only has to save the register value, but also its matching flags. You get a conflict between the principle that "flags aren't saved on function calls" (because they get wiped too easily and you pretty much never want to preserve them anyways) and the principle that "flags should have the same calling convention as their matching registers" (since they presumably come from the same register rename and update together). It also does all sorts of weird things like change 'nops' into different operations (mov r0, r0; add r0, #0; and r0, #-1; and the st r0, [sp+0] + ld r0, [sp+0] sequence all become different due to flag effects). Mov preserves flags (so that you can make 'mov' happen 'for free' by doing it at the register rename stage) but mov immediate doesn't. This is bad for late compiler optimization passes (because it has to take flags into account). So I think having flags be a part of value registers creates more problems than it solves.

If Bignums are really important and we need fast ADC for stuff like RSA encryption, I suggest we should make ADC/SBC vector-unit only, and have specific flags registers attached to the vector unit (to reduce the number of dependency paths from the SIMD unit to the main unit). Also, I'd separate the flags that are a result of the SIMD operations (ie carry) from the flags that control SIMD operations (zeroing, denormal-zeroing, etc), so that SIMD operations that update flags can simply wipe the previous value - Partial flag register updates are bad since it requires separate flag rename engine for every part that doesn't update together!

The exception to this is vector length (for variable vector length), which has to be on the integer-scalar unit because it has to be known by load/store instructions.

-

Ideally, it's best if SIMD instructions cannot cause interrupts and can't affect program flow. For most architectures, SIMD operations execute on a different unit with a different issue queue, so it's best if non-SIMD and SIMD operations can be separated as fast and easily as possible - basically right after determining instruction size, since operations go to different queues and compete for different register ports and so forth. In theory, you could even design a cpu with a separate IP and instruction cache for the SIMD unit, and do all SIMD loads/stores through a queue (the PS3's Cell is sorta like this, in a way).

For instance, on the ARM Cortex A8, NEON/FPU instructions literally CANNOT cause interrupts since the non-SIMD instruction results don't even commit at the same time, so SIMD instructions and every subsequent instruction have to be 100% sure to run (because the result of subsequent non-SIMD has already been committed so it cannot be undone). The benefit of this is that the non-SIMD commit unit doesn't even have to know that the SIMD unit even exists except for receiving store values in a queue, and the SIMD unit likewise knows nothing about the non-SIMD unit except that instructions and values loaded from memory and forwarded from GPRs arrive in a queue and that store values go in a queue.

X86 enforces this less strongly (so that committing has to be synchronized between the general-purpose unit and the SIMD unit) but even then, there's a reason why, on the Athlon, COMISS (SSE float compare and update main unit flags register) runs on the Vector Path (microcode!) - and it's not because AMD engineers thought people wouldn't use COMISS.

Basically, I don't think orthogonality between non-SIMD instructions and SIMD instructions is a good idea, since they have different goals: non-SIMD instructions have to have as few weird side effects as possible and retire as fast as possible, so that they can be renamed and reordered and jumbled and rolled-back if prediction has failed (or a load/store caused a page fault). SIMD instructions just have to do as much math as possible per cycle - they're all about throughput, so it doesn't matter if they take 4 or 5 cycles to complete, if they can't be reordered and so forth - which is why VLIW is popular in cpus designed for DSP (they don't have to run C++!).

SIMD-oriented code also tends to be more likely to be simple short loops so I don't think it really has to be particularly compact. Also the memory bandwidth usage for instructions will probably be totally dwarfed by the memory bandwidth usage for data in SIMD code anyways.

-

I don't think saving/restoring the register file on task switch using 32 consecutive loads/stores (+sp update) is THAT much of a problem because task switches cause other much slower side effects - for instance, you're likely to get a whole bunch of instruction cache misses and data cache misses and TLB evictions and TLB misses and branch prediction misses and the cache prefetcher getting confused - those are many times more costly.

To handle interrupts, you do need a few scratchpad registers that are only accessible to the OS, for saving the previous values of SP + IP + OS/user mode + a couple extra registers. This is to get work space to "bootstrap" the interrupt handler state save/restore. Early MIPS had the problem that it didn't really have those system-reserved special registers, so you unfortunately lost a couple of general purpose registers instead.

You also probably need the hardware TLB to have different memory mappings for OS and user and switch automatically between those. Another way to deal with this is having a few banked registers (typically SP and the last few registers - just enough to initiate state saving). Even though this makes interrupt handler prologues kinda ugly, it also removes the need for microcoded interrupt handlers (which are often somewhat bypassed by the OS anyways).

 
thread Proposal for an ideal extensible instruction set new - Agner - 2015-12-27
replythread Itanium new - Ethan - 2015-12-28
last reply Itanium new - Agner - 2015-12-28
replythread Proposal for an ideal extensible instruction set new - hagbardCeline - 2015-12-28
last replythread Proposal for an ideal extensible instruction set new - Agner - 2015-12-28
reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-04
reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-04
reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-04
replythread Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-04
reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-05
replythread Proposal for an ideal extensible instruction set new - John D. McCalpin - 2016-01-05
last reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-06
last reply Proposal for an ideal extensible instruction set new - Ook - 2016-01-05
last reply Proposal for an ideal extensible instruction set new - acppcoder - 2016-03-27
reply Proposal for an ideal extensible instruction set new - Jake Stine - 2016-01-11
replythread Proposal for an ideal extensible instruction set new - Agner - 2016-01-12
last replythread Proposal for an ideal extensible instruction set new - Jonathan Morton - 2016-02-02
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-02-03
last replythread Proposal for an ideal extensible instruction set new - Jonathan Morton - 2016-02-12
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-02-18
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-02-21
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-02-22
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-02-23
replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-02-23
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-02-24
last replythread Proposal for an ideal extensible instruction set new - asdf - 2016-02-24
last reply Proposal for an ideal extensible instruction set new - Agner - 2016-02-24
last reply Proposal for an ideal extensible instruction set new - Agner - 2016-02-25
replythread limit instruction length to power of 2 new - A-11 - 2016-02-24
last replythread limit instruction length to power of 2 new - Agner - 2016-02-24
replythread Any techniques for more than 2 loads per cycle? new - Hubert Lamontagne - 2016-02-24
last reply Any techniques for more than 2 loads per cycle? new - Agner - 2016-02-25
last replythread limit instruction length to power of 2 new - A-11 - 2016-02-25
last reply limit instruction length to power of 2 new - Hubert Lamontagne - 2016-02-25
replythread More ideas new - Agner - 2016-03-04
replythread More ideas new - Hubert Lamontagne - 2016-03-07
last reply More ideas new - Agner - 2016-03-08
last reply More ideas new - Agner - 2016-03-09
replythread Proposal for an ideal extensible instruction set new - Joe Duarte - 2016-03-07
reply Proposal for an ideal extensible instruction set new - Agner - 2016-03-08
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-03-08
last replythread Proposal for an ideal extensible instruction set new - Joe Duarte - 2016-03-09
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-03-10
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-03-11
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-03-11
last replythread Proposal for an ideal extensible instruction set new - anon2718 - 2016-03-13
last reply Proposal for an ideal extensible instruction set new - Agner - 2016-03-14
replythread A design without a TLB new - Agner - 2016-03-11
replythread A design without a TLB new - Hubert Lamontagne - 2016-03-11
reply A design without a TLB new - Agner - 2016-03-11
last reply A design without a TLB new - Agner - 2016-03-12
reply A design without a TLB new - Bigos - 2016-03-13
last reply A design without a TLB new - Agner - 2016-03-28
replythread Proposal now published new - Agner - 2016-03-22
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-23
last replythread Proposal now published new - Agner - 2016-03-24
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-24
last replythread Proposal now published new - Agner - 2016-03-24
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-24
last replythread Proposal now published new - Agner - 2016-03-25
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-28
last replythread Proposal now published new - Agner - 2016-03-29
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-30
last replythread Proposal now published new - Agner - 2016-03-30
last replythread Do we need instructions with two outputs? new - Agner - 2016-03-31
last replythread Do we need instructions with two outputs? - Hubert Lamontagne - 2016-04-01
reply Do we need instructions with two outputs? new - Agner - 2016-04-01
replythread Do we need instructions with two outputs? new - Joe Duarte - 2016-04-02
last replythread Do we need instructions with two outputs? new - Agner - 2016-04-02
last reply Do we need instructions with two outputs? new - Joe Duarte - 2016-04-02
last replythread Do we need instructions with two outputs? new - Agner - 2016-04-02
last replythread Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-02
last replythread Do we need instructions with two outputs? new - Agner - 2016-04-03
reply Do we need instructions with two outputs? new - Joe Duarte - 2016-04-03
last replythread Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-03
last replythread Do we need instructions with two outputs? new - Agner - 2016-04-04
reply Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-04
last replythread Do we need instructions with two outputs? new - Joe Duarte - 2016-04-06
last replythread Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-07
last replythread Do we need instructions with two outputs? new - HarryDev - 2016-04-08
last reply Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-09
replythread How about stack machine ISA? new - A-11 - 2016-04-10
last replythread treating stack ISA as CISC architecure new - A-11 - 2016-04-14
last replythread treating stack ISA as CISC architecure new - Agner - 2016-04-14
last replythread treating stack ISA as CISC architecure new - A-11 - 2016-04-17
replythread treating stack ISA as CISC architecure new - Hubert Lamontagne - 2016-04-17
last replythread stack ISA versus long vectors new - Agner - 2016-04-18
last replythread stack ISA versus long vectors new - Hubert Lamontagne - 2016-04-19
last reply stack ISA versus long vectors new - Agner - 2016-04-20
last reply treating stack ISA as CISC architecure new - A-11 - 2016-04-18
replythread Proposal for an ideal extensible instruction set new - zboson - 2016-04-11
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-04-11
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-04-11
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-04-12
last reply Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-04-12
replythread Version 1.01 new - Agner - 2016-05-10
last replythread Version 1.01 new - Hubert Lamontagne - 2016-05-13
last replythread Version 1.01 new - Agner - 2016-05-14
last replythread Version 1.01 new - Harry - 2016-06-02
replythread Public repository new - Agner - 2016-06-02
reply Public repository new - Harry - 2016-06-02
last reply Public repository new - Harry - 2016-06-02
last reply Public repository new - Agner - 2016-06-09
replythread Rethinking DLLs and shared objects new - Agner - 2016-05-20
replythread Rethinking DLLs and shared objects new - cv - 2016-05-20
last reply Rethinking DLLs and shared objects new - Agner - 2016-05-20
replythread Rethinking DLLs and shared objects new - Peter Cordes - 2016-05-30
last replythread Rethinking DLLs and shared objects new - Agner - 2016-05-30
last replythread Rethinking DLLs and shared objects new - Joe Duarte - 2016-06-17
last replythread Rethinking DLLs and shared objects new - Agner - 2016-06-18
last reply Rethinking DLLs and shared objects new - Bigos - 2016-06-18
last replythread Rethinking DLLs and shared objects new - Freddie Witherden - 2016-06-02
last replythread Rethinking DLLs and shared objects new - Agner - 2016-06-04
last replythread Rethinking DLLs and shared objects new - Freddie Witherden - 2016-06-04
last reply Rethinking DLLs and shared objects new - Agner - 2016-06-06
replythread Is it better to have two stacks? new - Agner - 2016-06-05
reply Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-07
replythread Is it better to have two stacks? new - Eden Segal - 2016-06-13
last replythread Is it better to have two stacks? new - Agner - 2016-06-13
last replythread Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-14
last replythread Is it better to have two stacks? new - Agner - 2016-06-14
last replythread Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-15
last replythread Is it better to have two stacks? new - Agner - 2016-06-15
last replythread Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-16
last replythread Is it better to have two stacks? new - Agner - 2016-06-16
last reply Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-17
last reply Is it better to have two stacks? new - Freddie Witherden - 2016-06-22
last reply Now on Github new - Agner - 2016-06-26