And each program will be built by
people with varying levels of skill. People who will
not respect the beauty behind the memory
organization
scheme. And, you have to cater to them, because if
their program brings the system to its knees, your
chip and the OS will be blamed.
In other words,
some
crappy app should be able to disrespect the entire
memory space...and your chip and OS should handle it
with some grace, I believe.
Some performance critical programs are
written by highly skilled programmers who want to
tweak as much performance out of the system as
possible. Depending on the application, they may be
able to predict how much memory they will need and
allocate it all at once. Such applications would only
need a handful of memory blocks which could easily be
handled by an on-chip memory map. Other programs,
probably less critical, are made with point-and-click
tools and abstract frameworks by people who have no
idea what is going on behind the scenes. They may
cause heavy memory fragmentation without even knowing
it. Maybe we can have a dual system with an on-chip
memory map for well-behaved applications and a
partially software-based paging system which will be
activated only if the memory becomes too fragmented.
Programmers who make well-behaved applications will be
awarded with superior performance.
Believe me, I like the idea of the super-fast memory map, and the dual system idea is interesting. I appreciate a system that remembers to give the programmer the keys to the kingdom, so to speak. Cause most systems provide a dumbed-down, common-denominator approach. But a dual system? That makes the system that much more complicated, doesn't it.? Maybe OS drivers and OS-protected low-level stuff could use the on-chip stuff exclusively. Otherwise, I'll be pushing to have my Super Zombie Crusher 3 pushing OS stuff out to let my 100-fps renderer live on-chip, right?!! What programmer would choose the slow model? I don't know - the more I think about it, the more I lean towards the traditional model. Yes, it's slow and complex. But, that's because it's necessary, to get the power you want and need. I think Intel got it right, somewhat...it's hard to deny. Simplifying it upfront leads to more complex schemes down the road. To me it seems inevitable. Now, please reconsider on-chip mass memory, which could alleviate this issue and others - see below.
Backto the subject of dreaming about my
ideal CPU: These are not well thought out, or even
practical, most likely. But they'd be nice Here
goes:
1. 16Gb memory on the same chip as the
CPU. I
don't know what the limitations are, but, if all
your
PC's memory could be super-fast, on-chip, wow!
Memory
wait states slow down the CPU a lot, and, ,if it was
all on the same chip, you could eliminate all the
complex nightmare caching hardware.
2. If not #1,
then some really good cache hint/directive
instructions.
3. Instead of relying on branch
prediction, why not take both branches, and provide
the ability to swap pipelines to use the confirmed
branch? This dual pipeline could be used for extra
execution when branching was not occurring.
4. A
hardware block move/fill/swap, page-based. Runs
like a
background job. An instruction could be used to test
for completion by comparing a given address against
each pending job's address range.
5. Instead of
saving registers on task switch, use an array of
registers, indexed by taskid.
I'm sure I could
think
up some more. As I stated before, I realize that
most
of these are far-fetched, and do not really fit into
Mr. Fog's design.
These ideas are not farfetched, and some
have already been implemented in various systems.
Swapping register banks was even supported in some
processors back in the 1980s. Putting RAM on the chip
is an obvious thing to do for the less memory-hungry
applications. The more RAM you put on the chip the
slower it will be, so you need one or more levels of
cache in between anyway.
On the subject of massive on-chip memory: It's slow. Ok. But, how slow?. I really want to see the numbers on this one. Maybe it requires a new technology. But, let's assume it's possible. Think about it. Low fixed wait states. 0 cache misses. No complicated cache hardware - this is a big one. Or, that pipeline burst cache hardware instead force feeds the execution pipeline with instructions. Cache misses happen a lot. And, surely it's going to be faster than external memory. Is it expensive, money-wise? Is that why we don't see it being attempted Maybe you could spread the memory across multiple cores.
Some algorithms simply cannot avoid reading memory "vertically", such as 90-degree block rotation algortihms. These functions can suffer a cache invalidation on each read, unless proper cache hints are provided...yielding a needlessly slow read/write.
To me, cache misses and branch mis-prediction are the 2 things that prevent us from being able to optimize our code accurately, cause we cannot determine how much time each instruction will take, therefore we cannot pair instructions perfectly. Knowing exact memory timing allows the pipeline to be finely-tuned, I'd imagine. Removing the cache hardware provides fast chip real estate to put RAM, and simplifies the design enough to justify research into making it possible.
And, your memory map could sit in this same main memory too.