Agner`s CPU blog

The discussion in the thread Proposal for an ideal extensible instruction set has been very fruitful and led to a lot of new ideas. I don't know where this will lead us, but the project looks so promising that it is worth pursuing further.

I have put the project on Github to facilitate collective development of the instruction set, software toolchain, system standards, and hardware implementations. The address is github.com/ForwardCom. The development of hardware should be done in collaboration with the Opencores forum.

The name CRISC was taken, so I have changed the name to ForwardCom. It stands for "forward compatible computer system".

The latest version of the manual is at github.com/ForwardCom/manual or www.agner.org/optimize/#instructionset. I have also reserved the domain name www.forwardcom.info.

The ForwardCom project includes both a new instruction set architecture and the corresponding ecosystem of software standards, application binary interface (ABI), memory management, development tools, library formats and system functions. Here are some highlights:

The ForwardCom instruction set is a compromise between the RISC and CISC principles, combining the fast and streamlined decoding and pipeline design of RISC systems with the compactness and more work done per instruction of CISC systems.
The ForwardCom design is scalable to support small embedded systems as well as large supercomputers and vector processors without losing binary compatibility.
Vector registers of variable length are provided for efficient handling of large data sets.
Array loops are implemented in a new flexible way that automatically uses the maximum vector length supported by the microprocessor in all but the last iteration of a loop. The last iteration automatically uses a vector length that fits the remaining number of elements. No extra code is needed to deal with remaining data and special cases. There is no need to compile the code separately for different microprocessors with different vector lengths.
No recompilation or update of software is needed when a new microprocessor with longer vector registers becomes available. The software is guaranteed to be forward compatible and take advantage of the longer vectors of new microprocessor models.
Strong security features are a fundamental part of the hardware and software design.
Memory management is simpler and more efficient than in traditional systems. Various techniques are used for avoiding memory fragmentation. There is no memory paging and no translation lookaside buffer (TLB). Instead, there is a memory map with a limited number of sections with variable size.
There are no dynamic link libraries (DLLs) or shared objects. Instead, there is only one type of function libraries that can be used for both static and dynamic linking. Only the part of the library that is actually used is loaded and linked. The library code is kept contiguous with the main program code in almost all cases. It is possible to automatically choose between different versions of a function or library at load time, based on the hardware configuration, operating system, or user interface framework.
A mechanism for calculating the required stack size is provided. This can prevent stack overflow in most cases without making the stack bigger than necessary.
A mechanism for optimal register allocation across program modules and function libraries is provided. This makes it possible to keep most variables in registers without spilling to memory. Vector registers can be saved in an efficient way that stores only the part of the register that is actually used.

Author: Agner	Date: 2016-08-16 03:11
Actually, we could limit the maximum number of input dependencies to four if we make the rule that instructions with three input operands including a memory operand cannot have a mask. That would not be a serious limitation.

Author:	Date: 2016-07-12 03:35
It is probably not difficult to create a new x86 version that is user mode compatible with most modern programs but lacks things like segmentation (including the GDT/LDT) and real mode. New OS versions would be required, but most modern user mode programs would work with few if any modifications. Anyone want to do a proposal for that?

Author: Agner	Date: 2016-08-01 01:48
I have made an introduction to the ForwardCom project at www.forwardcom.info. I have added a few optional instructions to facilitate matrix multiplication.

Author: Agner	Date: 2017-07-20 23:27
I don't like the idea of using the same bitfield as both instruction and register. You will not have a free choice of what register to use, and the out-of-order scheduler will have problems detecting whether it should wait for the value of that register or not.

Author:	Date: 2016-08-08 06:00
Hello Agner, what about an hardware accelerated Decimal Floating Point Unit as present in IBM Power CPUs? In particular for business applications the decimal type is necessary but now being done in software is 100x slower that float / double!

Author:	Date: 2016-09-05 01:53
One recurrent question: How will ForwardCom run Linux's mmap function?

Author:	Date: 2016-09-07 17:18
Agner wrote: And it could be thread safe because each thread has its own memory map and its own private memory in ForwardCom. Ah, but if your threads don't share memory, then they're not threads, they're separate processes, which is a different thing!

Author:	Date: 2016-09-08 13:04
Commenter wrote: In fact, I'd imagine that with faster NVRAM style devices, mmap may allow the file to be read directly off the I/O device without ever being read into system RAM. Actually, this is what Linux DAX infrastructure does. It allows you to map nvram directly to the process's address space.

Author: Agner	Date: 2016-09-27 01:21
csdt wrote: Is it possible to use any vector instruction with an immediate scalar broadcasted? Yes. The immediate value can have different sizes, integer or float or a small signed integer converted to float.

Author: Agner	Date: 2016-10-29 00:48
Hubert Lamontagne wrote: How well does ForwardCom handle bilinear interpolation? It depends on how your data are organized into vectors/arrays. It requires a good deal of permutation. ForwardCom has good permutation instructions, but so does other vector instruction sets.

Author: Agner	Date: 2016-10-30 00:42
I don't think I understand your problem. You have four RGBA points in each their vector register. All of these should be multiplied by a factor and then it should all be added together. It's just multiplication and addition of 8-bit integers. You may zero-extend all to 16 bits to avoid loss of precision; then shift right and compress back to 8 bits.

Author:	Date: 2017-01-05 16:47
RISC-V is starting to reach silicon. Performance is looking pretty good, for a microcontroller. (comparable to the ARM in a Teensy)

Author: Agner	Date: 2017-01-25 13:46
Jonathan Brandmeyer wrote: One of the gotchas from ARM banked registers is that the FIQ register state was retained across ISR return and reentry. We probably have to clear the registers for security reasons.

Author: Agner	Date: 2017-02-14 13:20
This idea about decoupling the control flow from execution is now included as a proposal in the manual version 1.06 (chapter 8.1). www.agner.org/optimize/forwardcom.pdf

Author:	Date: 2017-03-21 03:41
Agner the code to do an addition between "big integers" that is at page 85 of the forwardcom manual is applicable to any CPU that can do vector integers addition? It could work with x86 using PADDB? This idea is applicable to all other operators (-, *, /, &...) right?

Author:	Date: 2017-04-13 20:12
FYI, Intel has an upcoming solution for protecting return addresses: https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf It would be interesting to compare ForwardCom's approach to CET.

Author: Agner	Date: 2017-04-27 00:05
I have started a new thread with a proposal for re-linkable libraries in ForwardCom: here

Author:	Date: 2017-08-11 09:13
This looks like the most convenient and easiest to use assembler I have ever seen. I'm really looking forward to playing around with it. Are there reasons why the preprocessing stage and the metaprogramming stage can't be merged into one stage? This would reduce the assembler to only two kinds of code, but I don't know if there would be complications.