Author: Anonymous |
Date: 2016-07-28 07:10 |
Agner, have you studied the MIPS architecture? What do you think of "branch delay slots" ? Personally, I like them, because of the following: There are things the software developer (and the compiler) easily knows, but the hardware developer will hardly know it. The instruction set should give the software developer means to express that knowledge to the hardware. One thing software developers know is which procedure will be called on a function pointer (because of polymorphism, indirect branches have become quite popular), but the hardware couldn't possible know it. When indirect branch instructions like "CALL (register)", "JR (register)" or "JALR (register)" are executed, the hardware is absolutely clueless as to where to move forward, because it needs to know the value of the register before proceding (which it likely won't know. Because of the modern pipelined approach, it needs the write-back value of the register). This will very likely lead to a stall. But, with the instructions "SETCALL (register)" and "CALL (no arguments)", "SETJUMP (register)" and "JUMP (no arguments)", it's possible to execute instructions while the processor is fetching instructions from the target address, like this: SETCALL (register) ; sets the call target, the processors starts to fetch data from it.
instruction one...
instruction two...
instruction n...
CALL (no arguments) ; calls the procedure pointed by the call target. The delay slot gives the processor time to prepare for the call or jump, and relieves the branch predictor of work. The branch predictor might be completely removed from the processor with this approach (more silicon for the hardware developer!). I have never seen any processor with instructions like that (and I have studied the 68000, x86, x64, ARM and MIPS). "if" statements could also benefit from this approach. Delay slots are not novel. The novel part here is its size can be arbitrarily set. I like the statements on the section "10.8 Function libraries and link methods". I wonder if they are there because of the indirect branch problem. Cache miss is also a big topic. Suppose I have five functions (functions in the mathematical sense), A, B, C, D, E, which can be fully executed in parallel. The calls instructions are laid in the following order: CALL A; CALL B; CALL C; CALL D; CALL E; But only the data for the E function is ready (is on the cache). Would the processor stall, while waiting for the data of A, B, C, D arrive? And, even worse, wait for the completion of A, B, C and D? I like that vectors are present on this new instruction set and are an important part of it. Many algorithms use matrix multiplication, which can greatly benefit from instruction-level parallelism, but the current instructions sets are simply disappointing. I'm currently trying to write a fast 2048-bit modular multiplier (for exponentiation with 2048-bit exponents, which should happen under tens of milliseconds, and currently is taking hundreds of milliseconds), and, although Toom-Cook algorithm uses matrix multiplication, it's simply not possible to express it nicely with the instructions given by x86 or x64. |