Agner`s CPU blog

Paging

Author:

Date: 2016-09-13 19:08

Agner wrote:

Kurt Baumgardner wrote:

And each program will be built by people with varying levels of skill. People who will not respect the beauty behind the memory organization scheme. And, you have to cater to them, because if their program brings the system to its knees, your chip and the OS will be blamed.
In other words, some crappy app should be able to disrespect the entire memory space...and your chip and OS should handle it with some grace, I believe.
Some performance critical programs are written by highly skilled programmers who want to tweak as much performance out of the system as possible. Depending on the application, they may be able to predict how much memory they will need and allocate it all at once. Such applications would only need a handful of memory blocks which could easily be handled by an on-chip memory map. Other programs, probably less critical, are made with point-and-click tools and abstract frameworks by people who have no idea what is going on behind the scenes. They may cause heavy memory fragmentation without even knowing it. Maybe we can have a dual system with an on-chip memory map for well-behaved applications and a partially software-based paging system which will be activated only if the memory becomes too fragmented. Programmers who make well-behaved applications will be awarded with superior performance.

Believe me, I like the idea of the super-fast memory map, and the dual system idea is interesting. I appreciate a system that remembers to give the programmer the keys to the kingdom, so to speak. Cause most systems provide a dumbed-down, common-denominator approach. But a dual system? That makes the system that much more complicated, doesn't it.? Maybe OS drivers and OS-protected low-level stuff could use the on-chip stuff exclusively. Otherwise, I'll be pushing to have my Super Zombie Crusher 3 pushing OS stuff out to let my 100-fps renderer live on-chip, right?!! What programmer would choose the slow model? I don't know - the more I think about it, the more I lean towards the traditional model. Yes, it's slow and complex. But, that's because it's necessary, to get the power you want and need. I think Intel got it right, somewhat...it's hard to deny. Simplifying it upfront leads to more complex schemes down the road. To me it seems inevitable. Now, please reconsider on-chip mass memory, which could alleviate this issue and others - see below.

Backto the subject of dreaming about my ideal CPU: These are not well thought out, or even practical, most likely. But they'd be nice Here goes:
1. 16Gb memory on the same chip as the CPU. I don't know what the limitations are, but, if all your PC's memory could be super-fast, on-chip, wow! Memory wait states slow down the CPU a lot, and, ,if it was all on the same chip, you could eliminate all the complex nightmare caching hardware.
2. If not #1, then some really good cache hint/directive instructions.
3. Instead of relying on branch prediction, why not take both branches, and provide the ability to swap pipelines to use the confirmed branch? This dual pipeline could be used for extra execution when branching was not occurring.
4. A hardware block move/fill/swap, page-based. Runs like a background job. An instruction could be used to test for completion by comparing a given address against each pending job's address range.
5. Instead of saving registers on task switch, use an array of registers, indexed by taskid.
I'm sure I could think up some more. As I stated before, I realize that most of these are far-fetched, and do not really fit into Mr. Fog's design.
These ideas are not farfetched, and some have already been implemented in various systems. Swapping register banks was even supported in some processors back in the 1980s. Putting RAM on the chip is an obvious thing to do for the less memory-hungry applications. The more RAM you put on the chip the slower it will be, so you need one or more levels of cache in between anyway.

On the subject of massive on-chip memory: It's slow. Ok. But, how slow?. I really want to see the numbers on this one. Maybe it requires a new technology. But, let's assume it's possible. Think about it. Low fixed wait states. 0 cache misses. No complicated cache hardware - this is a big one. Or, that pipeline burst cache hardware instead force feeds the execution pipeline with instructions. Cache misses happen a lot. And, surely it's going to be faster than external memory. Is it expensive, money-wise? Is that why we don't see it being attempted Maybe you could spread the memory across multiple cores.

Some algorithms simply cannot avoid reading memory "vertically", such as 90-degree block rotation algortihms. These functions can suffer a cache invalidation on each read, unless proper cache hints are provided...yielding a needlessly slow read/write.

To me, cache misses and branch mis-prediction are the 2 things that prevent us from being able to optimize our code accurately, cause we cannot determine how much time each instruction will take, therefore we cannot pair instructions perfectly. Knowing exact memory timing allows the pipeline to be finely-tuned, I'd imagine. Removing the cache hardware provides fast chip real estate to put RAM, and simplifies the design enough to justify research into making it possible.

And, your memory map could sit in this same main memory too.

Reply To This Message

Previous Message

Next Message