@Alex: No, because L1D stores are sent to a 4KB WB buffer for coalescing before L2 -that's why L1D is WT, of course. It might be interesting to do 'overflow' such buffer and see what happens. Could the full load LS fractional number depend on a hidden latency that happens from time to time when WCC is forced to free a line down to L2?
@Fellix: BD architecture looks interesting and innovative. Would you mind to share your detailed MiArch comparison manuals with us?Agner, re-reading the manuals I noticed a point I did overlook: the BD decoder is NOT evolved in a 2-1-1-1 (4 instruction) like the IA, but it's still a 2-1-1, so the "4 instr/cycle" is actually a double-path and two single path!
Since much more BD instructions are single-mop (the most used ones in my experience and my analysis) compared to Intel, wouldn't it make for a much better decoder throughput than Intel, if they had a 2-1-1-1 one? ..in essence, such decoder can at BEST 'pump up' 1,5 instructions/cycle to the ALU on a full load, max 3 on a single-core load if no double-path instructions are crossed. odd. |