At the moment, the biggest red flag I can see in this proposal is the use of "doubleword" to refer to 32 bits. Everywhere outside the Wintel enclave - including the ubiquitous IEEE-754 standard - "doubleword" means 64 bits, and 16 bits is referred to as "halfword". It's a Freudian slip which betrays a certain cognitive bias towards the same obsolete 16-bit architecture you're proposing to replace. Not what one wants to see when trying to look forward. I'm also somewhat perplexed by your simultaneous assertion that a fixed-size RISC instruction word causes low code density, and the specification of a *minimum* instruction word size that is the same as a typical RISC instruction word size. The only efficiency you gain is the ability to use large immediate operands in a single instruction, but modern RISC ISAs (Alpha, AArch64, PowerPC) can already build an arbitrary 32-bit immediate in 2 instructions (64 bits), limiting your code density advantage to immediate operands larger than 32 bits. These are not so common. I'll admit that combined load-arithmetic instructions can improve code density, but that comes at the expense of a more complicated front-end in hardware (or, for traditional in-order CISC, a more complicated pipeline) and has absolutely nothing to do with instruction word size. You do also gain a certain future-proofing flexibility by allowing longer instruction formats, but that has absolutely nothing to do with code density. With that said, I have a different proposal for handling vectors, which I think is closer to the original Cray model. In this model, there are no architectural "vector registers", only scalars and "pipeline slots". Conceptually, the machine appears to repeat instructions a given number of times on successive data elements, but without an explicit branch instruction, similarly to the x86 "string" instructions. The cleanest way to specify this I can think of is, oddly, similar to the x87 stack model. Now, x87 was a horrible model for high-performance arithmetic, because each instruction could do only one operation, it was impossible to specify software pipelining (executing more than one complex expression in parallel), and it was therefore hard to extract ILP at runtime. But it did allow specifying a single expression compactly and without explicit reference to register names. Substitute Forth as a mental model if you prefer. Vector instructions would thus act as if on scalar values, with explicit load, store and pointer-update operations, referring to this stack for their virtual input and output operands. Instead of executing immediately, they would be stored in a buffer, and decoded into a pipeline of operations, with the expectation of operating this entire pipeline in multiple-parallel at maximum performance. The pipeline would implicitly be complete when the operand stack became empty; attempting to execute a pipeline with an imbalanced stack would be a trap error. The complete pipeline would then be executed by loading the initial values into the relevant scalar registers (which were specified using "input" instructions), followed by a count value into a special-purpose register. The normal instruction flow would also continue, but a pipeline-wait instruction would prove useful. Interrupts, including page faults, would not inherently disrupt this pipeline building or execution process, and would be able to use the scalar registers independently. It would be necessary to halt, save, restore and resume the pipeline state (whether empty, in the process of being built, complete, or executing) for context switching and page-fault handling. The great advantage of this system is that the program need be aware of neither the number of operations the CPU can perform in parallel (which was an inherent flaw in Itanium, as well as with block-SIMD), nor the alignment requirements of the memory system (bar those of the individual data elements), without even needing to query them at runtime. An austere implementation could be entirely serial, operating like a standard for-loop over an x87, within a physical register set barely larger than that required to support the architectural scalar registers. A high-performance implementation might, in extreme cases, farm the pipeline out to something like a GPU - an idea which would certainly prick AMD's ears up. |