Back to the subject of dreaming about my
ideal CPU: These are not well thought out, or even
practical, most likely. But they'd be nice Here
goes:
1. 16Gb memory on the same chip as the CPU. I
don't know what the limitations are, but, if all your
PC's memory could be super-fast, on-chip, wow! Memory
wait states slow down the CPU a lot, and, ,if it was
all on the same chip, you could eliminate all the
complex nightmare caching hardware.
2. If not #1,
then some really good cache hint/directive
instructions.
Generally, this is done with instructions like PREFETCH on x86.
3. Instead of relying on branch
prediction, why not take both branches, and provide
the ability to swap pipelines to use the confirmed
branch? This dual pipeline could be used for extra
execution when branching was not occurring.
A lot of branch instructions that your CPU pipeline will see are in loops and generally branch the same way a lot of times in a row, and would actually get slower if you ran both sides, because rarely executed paths would be competing for resources. Here's a random C++ example:
The branch at the end of the loop (i<200) will be taken 199 times in a row, and only skipped once. Speculatively executing the code after the loop makes no sense in this case. In addition to this, the branch in the middle of the loop (if(mVolumeFade)) will either always be taken or never be taken in the 200 iterations, so in that case speculatively executing both sides of the loop doesn't make sense either, and it's better to trust your branch predictor.
Speculatively executing both sides is generally something you'll only see in very large and complex out-of-order CPU cores, and I think it probably involves something like a 3 or 4 way branch predictor that predicts NO-JUMP/MAYBE-JUMP/YES-JUMP or NO-JUMP/MAYBE-NO-JUMP/MAYBE-YES-JUMP/YES-JUMP. I don't think it handles multi-way branch prediction too well either (ie jump to function pointer... some modern cores can handle multiple jump targets like this).
4. A
hardware block move/fill/swap, page-based. Runs like a
background job. An instruction could be used to test
for completion by comparing a given address against
each pending job's address range.
Sounds a lot like a DMA controller - and if I'm not mistaken, the x86 DMA controller can be used for that, but I don't think it runs fast enough to get a speed gain. And it kinda competes with the CPU for memory access cycles, which isn't too good, although I guess you could make it operate in the unused cycles.
5. Instead of
saving registers on task switch, use an array of
registers, indexed by taskid.
I'm sure I could think
up some more. As I stated before, I realize that most
of these are far-fetched, and do not really fit into
Mr. Fog's design. But, maybe one has merit, so I
present them for thought. I find the whole project
fascinating, and I hope one day that you can get chip
to the fabrication stage! Best of luck!
One that's worth it and that is used on ARM is to have 2 stack pointer registers, one for USER mode and one for OS mode, and switch when interrupts happen, because this makes interrupt handlers faster. Fast interrupt handlers are generally nice to have because then it makes stuff like software TLBs realistic, and it speeds up OS calls. Fast task switch between user programs is a bit less important because you're going to have tons of cache misses anyways, so not having to save/restore registers isn't as important (and a lot of programs can run at the same time, which makes it unlikely that you can hold all register sets at the same time).