Author: Hubert Lamontagne |
Date: 2016-08-09 17:49 |
Joe Duarte wrote:
I'm thinking of approaches Parabix:
http://parabix.costar.sfu.ca/
One of their papers:
https://www.cs.sfu.ca/~ashriram/publications/2012_HPCA_Parabix.pdf
I guess it just comes down to vector instructions and
their parallel bitstreams approach. (Another way to
boost it would be multiplexed streams, but that's a
different plane of the architecture than the ISA.)
Hmm, that's an interesting technique. Very reminiscent of how they used to do graphics on the Amiga and in EGA (which worked well for 2d but had to be abandoned once they started doing 3d).
Joe Duarte wrote:
Hubert wrote:
That's a pretty high target! To get
anything close to this performance, you'd need to
run
ForwardCom at about 4 instructions per cycle, and
considering the complexity of ForwardCom
instructions,
this would require a pretty large and complex CPU
core. My hunch is that it would run at about the
same
speed as x86, but simply require less design time to
implement all the crazy instruction unpacking and
286-era segmenting intricacies. I don't think
ForwardCom would scale very well past 4 instructions
per cycle, but to be fair I don't think any
instruction set does (they tried to do it with
Itanium
and failed).
Well, Broadwell-E is a high target right now, but the
target is moving. It's going to be a low target in
2020. Of course Agner's not necessarily thinking in
terms of marketable silicon, but more toward a useful
reference and engineering exercise. I'm just wondering
what kind of performance wins are possible. Another
way of framing it: If a well-funded team rolled up
their sleeves and built a new general purpose CPU
architecture from scratch, including a new ISA and
tape out and all that, using TSMC or Samsung's 16/14
nm process, could they beat Intel/Skylake? Maybe Intel
is wringing all the performance out of current
processes that the physics allows, and the legacy
technical debt is only a design cost overhead. They
seem to be treading water since Haswell, so I
sometimes wonder if a better architecture would make a
difference or if the wall is more fundamental than
Intel.
That's a good question. I think that's exactly what ARM is trying to do! In fact, there are a whole bunch of ARM licensees that are taking a stab at this, and imho it's proving to be a little bit harder than expected...
The catch is that once you have a cpu as big as a Broadwell, it has tons of stuff in it: in particular, a very complex memory architecture with a whole laundry list of features: aggressively pipelined multibank data cache with 2 read ports and 1 write port, TLB, out-of-order non-blocking everything with a memory ordering buffer, prefetching, multi-core cache coherency protocol, handling of misaligned addresses... This is basically almost totally independent of the architecture - you could implement all the same stuff on an ARM or MIPS or POWER core and it would be just as complex.
If you factor in the other stuff like aggressively pipelined secondary units like the FPU and SIMD unit, heavy branch prediction, hyper threading... this is all stuff that's basically the same on other architectures. The only thing that x86 changes is the front end. The front end part of a CPU isn't THAT large and complex compared to all the rest on modern cores with so many features.
And if, say, some new silicon process changes the distribution of bottlenecks so that faster instruction decoding becomes more important again, then Intel can simply put in a larger or broader micro-code cache, or maybe even go back to the instruction trace cache like on the Pentium 4. (ok, the P4 is a bit reviled, but the truth is that it still held on pretty good compared to the other well designed architectures of the time)
Joe Duarte wrote:
By the way, do you think Itanium was just too early,
with too little tooling? Does an out of order CISC
have any inherent advantage over VLIW or EPIC? It
seems goofy to send one instruction at a time. I
haven't found any deep-dive post-mortems on Itanium,
just vague claims that no one could build a good
compiler for it, or that the physical product just
wasn't that fast. I think we're going to see some big
advances in compiler technology in the near future,
with powerful platforms like IBM Watson doing things a
laptop-bound compiler can't do.
I think it boils down to a couple things... The first generation Itanium came out late. Because of this, RISCs and x86s of the day had an extra generation to catch up, which means that the first generation Itanium totally failed to impress. Second generation came out pretty fast and sorta fixed this, but Itanium could never quite reach up to the speed of the fastest RISCs and x86s. One guy said that it was never as fast as HP's previous architecture (PA-RISC, totally a classic RISC).
Second thing, Itanium was basically betting that out-of-order would fail to deliver enough speed gains to keep the RISCs and x86s relevant. It turns out that out-of-order delivers speed gains at just the right place: memory accesses. Real life programs have lots of memory accesses going on, which gives out-of-order a fairly hard-to-beat edge. It's just easier to deal with memory latency with an out-of-order pipeline than with the Itanium's insane ALAT thing with checked speculative loads and stores and whatnot. Late Itaniums can issue 12 instructions per cycle, which x86 will probably never be able to do, but even though x86 only issues 4 instructions per cycle, 3 of those can have a built-in memory access which can be crazy-reordered by the aggressive out-of-order pipeline, and it turns out that that's enough to smoke the Itanium.
There's a question on Quora about exactly this and it has tons of interesting answers:
Why did the Intel Itanium microprocessors fail? |