Hi, thanks - I see now what you mean. 4 tagged instructions goes to decoder 0-3; if decoder 0 gets 2MOP then decoder 3 resources are used/stalled. So, it's likely a decoder (3) shares the bus with decoder 0 for outputting MOPS to the OOOE scheduler.
Still, I cannot understand the huge IPC penalty of BD over SB. The LS is almost the same since nehalem (2R/1R1W), BD has a VERY slow REP MOVS (1/3 of SB, sounds very worrying if you consider that mem/str/array copies are still implemented with rep movx), but it cannot account for the performance loss. The lack of the 3rd ALU is important, yet OOOE could easily mix MOV and other instructions in between - for full(?) throughput I need to schedule asm instructions manually on IA.
So, while CMT is surely slown down alot by the decoder - what do you think of single-core performance pitfall, even with regards to K10 architecture? BD decoder/retirement seems better than K10 (max 4 MOPS), yet it lags behind. |