Massimo wrote:
* What do you think about the L1D WT choice with
higher latency (coupled with a WCC halfaway the L2)?
Does it impact much the speed for you?
In his analysis Agner also wrote about an instruction-throughput penalty with both cores active. Instead of 4 instructions per clock, he could only measure around ~3 instr. per clock on average. I speculate that this is the effect of the L1's WT strategy. Because of WT, stores have to be send to the L2, but the L2 can probably only handle *one* store instruction per clock, not 2. Thus, only 3 instr. instead of 4 per module. Agner also reported a maximum of ~3.6-3.7 instructions. Maybe he got more loads than the usual 2:1 load to store ratio in that case. But I dont know his code so I cant say for sure, only speculate. |