Hello, Agner.
Here are latest results from AIDA64 HW-bench for Zen: users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt . We can see it takes 0.23 cl to execute «2299 LNOP :LNOP8» (8-byte long NOP), which makes it ~35 B/cl. However, it's not clear whether it's from L1I or L0m (mop-cache). Also it's not clear about generating 2 mops by 1 decoder lane: is it possible for all 2-mop instructions or just AVX-256? So, can 2 mops fit in 1 mop-entry of L0m cache, even if it's not AVX-256 instruction? We only know that microcoded instructions are not cacheble, reading directly from mROM. Where „fused“ 2-mop instructions dissolve to 2 distinct mops?It is important to know topology and restrictions of L0m. Code portion from L1I (32 B probably) generate certain amount of mops to be cached. Intel CPU requires cached 32 B portion to have 1-18 mops to fit into 1-3 L0m lines (6-mop each), all located in a common set. And it's not possible to break multi-mop instruction between lines. A cached portion must have only 1 entry (at its start); jumping in the middle will cause a miss and refill, so there can be copies of same portion with different entry points. And there is a maximum of 2 jumps per line. Zen must have similar rules, that must be tested. Remember: some 4 years ago you did a test of Sandy Bridge's L0m and send me xls-file with some interesting results. I hope you still got that code. Eviction policy is also important. Is L0m inclusive with L1I? Will L0m flush on context switch? But we do know that L0m will statically divide for 2 threads. Also, L0m decreases branch misspredict penalty (if target address is cached) — by yet unknown value. Is it possible to read a line for one thread and write for another in the same cycle? It's good to know the details about 6 OoO-queues (14 mops each) for 6 GPR execution ports. How mops are distributed among them on allocation? Then, knowing that FMAs use 3 inputs, borrowing 1 read port „aside“, how many vector reads can be made per clock with both FMAs loaded? |