Agner`s CPU blog

Do we need instructions with two outputs?

Author: Agner

Date: 2016-04-03 13:49

Hubert Lamontagne wrote

One catch is that vector registers need special alignment. For instance, if your vector regs are 512bits and your DCache width is 512bits, you want your vectors to be 512bit aligned when saved to the stack, so you need to 512-align your stack pointer.

The code has to be compatible with different processors with different vector sizes, so we don't want the required alignment to depend on the processor. A separate stack for vectors is a possible solution, but very wasteful. Cache space is a limiting resource so I don't want to save a possibly very long vector when only a small part of it is used. This is why I think it is smart to save the vector length in the vector register itself. A save instruction will save only as much as is needed. In the caller-save situation, the function knows how much of the register is saved, so it can use a normal store instruction. In the callee-save situation, the function will rely on the register length information and use the special save instruction to save only what is needed. This situation will be rare anyway, because the 16 caller-save registers will be enough for must purposes.

Separate registers for floating point scalars would be useful if we had to save the full vector in a callee-save situation, but the length information eliminates this need.

I think we can live with 8-bytes alignment of vectors. The hardware can handle this with a simple barrel shifter, but it may have to load an extra cache line, of course. Large arrays should be aligned by the cache line size for optimum performance.

I think the format for saving vector registers with the special save instruction should be implementation dependent. It may use a single byte for the length if the length cannot exceed 128 bytes, or it may use more for longer vectors or for the sake of alignment. It may even compress the data if this can be done fast enough. For example, a boolean mask vector using only one bit of each 64-bit element can obviously be compressed a lot. The format will be padded to fit whatever alignment is optimal on the particular processor. The software should use data in the special "save format" for no other purpose than to restore the register on the same processor.

It is a disadvantage that the saved format may be longer than the maximum vector length when it includes the length information. But I think this is outweighed by the advantage that most saved registers will use less space. Many registers will be unused and store only a zero for the length.

There's an extra cost, it's just in a different non-obvious place: it forces the compiler to figure out if the carry bits are relevant for each operation in a chain, and if the compiler can't figure it out it will output less efficient. Whereas if carry flags are in their own register, and only written/read by some operations (on ARM, when SUB generates flags it is called SUBS and is a different instruction), then the compiler only ever has to worry about carry flags for instructions that expressly read/write them (ADDS / ADC / ADCS / SUBS / SBC / SBCS / CMPS etc), and then it just becomes one extra register pool in the compiler's register allocator.

As I wrote, I have found an alternative solution for add with carry. We only have to consider whether we need an efficient way of tracking integer overflow.

CPUID could be replaced by a bunch of read-only system registers that give the CPU model for instance.

Good idea!

Joe Duarte wrote:

Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses?

My priority is the performance of big systems. That's why I have 64-bit address space. All addresses are relative to some pointer (instruction pointer, data section pointer, stack pointer, or an arbitrary pointer) with a signed offset of 8 bits or 32 bits. The instruction size will not be reduced by having a smaller address space, but of course we could save some stack space by having a 32-bit mode. I don't like having two different modes, though. Then we would have problems with stack alignment for doubles, etc. Byte code languages can have their own smaller address space of course.

String parsing also falls in this kind of case. For instance, checking the length of a string is not so obvious: often the interpreter knows the number of bytes of a string, but some of those bytes are UTF-8 characters, and if you account for special cases (like ending up with Latin-1 aka CP-1252 text), then there's often really no alternative to just looping through the string byte per byte

I think we should use UTF-8 only. It is possible to search for a terminating zero by loading a full vector and compare all bytes in the vector with zero. My ABI requires a little extra space at the end of user memory to avoid access violation when reading past the end of a string that happens to be placed at the very end of user memory. But of course it is more efficient to save the length of the string.

I write C++ sound code, and for code with lots of multiplications that could overflow, I use floating point code _all the time_. :3

Hardware multipliers are expensive, and divisors are even more expensive. I wonder if we need to support multiplication and division of all operand sizes, including vectors of 8-bit and 16-bit integers, if programmers are using floating point anyway?

Reply To This Message

Previous Message

Next Message