Joe Duarte: 32 registers kinda balances the need for smaller instructions and the fact that smaller register files are faster, smaller and soak up less power, with the fact that memory accesses are slow and complex. With register renaming, physical regfiles generally have at least 64 registers (MIPS R10k) if not way more (88*3 on Athlon, many more on hyper-threaded CPUs) so any less than only saves instruction bits. You also see stack-like register windows (SPARC, i960, Am29k) and rotating register files (Itanium) but the general consensus seems to be that this is overdesigned and MIPS does just as well with 32 ordinary registers. 16 registers is almost as good in typical code (see: ARM, x64) but the "cost" of increasing from 16 to 32 is low enough that architectures tend to go with 32. Extremely wide in-order CPUs (ie VLIWs) might need more registers to keep all the values generated by software pipelining (Itanium illustrates this) but for "mainstream" designs this isn't considered to be a good plan (ie if you want to make a very wide core, you'll probably have to make it out-of-order to make it any faster than 2-instructions-per-cycle anyways). Also note that it's very common to have different numbers of float and SIMD registers. For instance, ARM has 16 registers, but its FPU has 32 registers (for the Arm A8/A9/A15/etc fpu, shared with SIMD). 2^N variable sizes exist because you want to be able to calculate array memory addresses with a bitshift. If you allow 24bit integers for instance, then your memory calculation becomes [pointer + (index<<1) + index], not so convenient. And DRAM tends to come in multiples of 8 bits or 9 bits (for parity). Some DSP architectures use 24bit, 48bit and other unusual integer sizes. The idea of idempotent instruction groups is interesting, and somewhat complementary to another different instruction grouping conceptual scheme I'm playing with (grouping chains of dependent instructions so that only the last instruction of the group writes to a register). --- "We've been in an x86, POSIX/Windows, and C rut for a very long time." This is for a good reason. The ~4 instruction per cycle out-of-order CPU is pretty hard to beat in terms of practicality and speed, and attempts to beat it face some pretty daunting challenges. Itanium was a valiant effort, but it failed and it just was never really faster than x86. One big problem is that the L1 data cache will, at best, have 2 read ports and 1 write port, and that typical code often has 30% of memory loads/stores. This means that it's hard to get a speed gain when making a cpu that runs more than about 4 instructions per cycle. The last DEC Alpha design was going to do 8 instructions per cycle, but it just couldn't do it for typical programs and they had to run multiple threads on the core to be able to keep the pipeline full. Part of the reason why Intel is top-of-the-game now is that they're top-of-the-memory-access-game. In C++, the program basically specifies the exact order of memory loads/stores, and it takes huge efforts to escape this ordering (compiler alias analysis, out-of-order cpus, weird speculative loads/stores in VLIWs). Multi-threading, SIMD and even GPUs can be viewed as basically mechanisms to make this ordering more flexible. Higher level languages like Python typically do even more loads/stores/jumps than C++, which makes them even less optimizable (since it's very likely that they are essentially serial, and they let you do crazy tricks that force you to do everything serially). If there's any hope to get a language that's more efficient than C++, I'd say that IMHO it's probably a language that forces a limitation of "absolutely no pointer aliasing" - so probably with no pointers, no references, no side-effects (and probably copy-on-write objects). |