Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for AMD Ryzen
Author: Tacit Murky Date: 2017-05-05 12:24
Hello, Agner.
Here are latest results from AIDA64 HW-bench for Zen: users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt . We can see it takes 0.23 cl to execute «2299 LNOP :LNOP8» (8-byte long NOP), which makes it ~35 B/cl. However, it's not clear whether it's from L1I or L0m (mop-cache). Also it's not clear about generating 2 mops by 1 decoder lane: is it possible for all 2-mop instructions or just AVX-256? So, can 2 mops fit in 1 mop-entry of L0m cache, even if it's not AVX-256 instruction? We only know that microcoded instructions are not cacheble, reading directly from mROM. Where „fused“ 2-mop instructions dissolve to 2 distinct mops?

It is important to know topology and restrictions of L0m. Code portion from L1I (32 B probably) generate certain amount of mops to be cached. Intel CPU requires cached 32 B portion to have 1-18 mops to fit into 1-3 L0m lines (6-mop each), all located in a common set. And it's not possible to break multi-mop instruction between lines. A cached portion must have only 1 entry (at its start); jumping in the middle will cause a miss and refill, so there can be copies of same portion with different entry points. And there is a maximum of 2 jumps per line. Zen must have similar rules, that must be tested. Remember: some 4 years ago you did a test of Sandy Bridge's L0m and send me xls-file with some interesting results. I hope you still got that code.

Eviction policy is also important. Is L0m inclusive with L1I? Will L0m flush on context switch? But we do know that L0m will statically divide for 2 threads. Also, L0m decreases branch misspredict penalty (if target address is cached) — by yet unknown value. Is it possible to read a line for one thread and write for another in the same cycle?

It's good to know the details about 6 OoO-queues (14 mops each) for 6 GPR execution ports. How mops are distributed among them on allocation? Then, knowing that FMAs use 3 inputs, borrowing 1 read port „aside“, how many vector reads can be made per clock with both FMAs loaded?

 
thread Test results for AMD Ryzen new - Agner - 2017-05-02
replythread Ryzen analyze new - Daniel - 2017-05-02
last reply Ryzen analyze new - Agner - 2017-05-02
replythread Test results for AMD Ryzen new - Peter Cordes - 2017-05-02
last replythread Test results for AMD Ryzen new - Agner - 2017-05-03
last replythread Test results for AMD Ryzen new - Phenominal - 2017-05-06
last replythread Test results for AMD Ryzen new - Agner - 2017-05-06
last replythread Test results for AMD Ryzen new - Phenominal - 2017-05-06
last reply Test results for AMD Ryzen new - Agner - 2017-05-06
replythread Test results for AMD Ryzen - Tacit Murky - 2017-05-05
last replythread Test results for AMD Ryzen new - Tacit Murky - 2017-07-08
last reply Test results for AMD Ryzen new - Michael Rolle - 2019-05-15
replythread Test results for AMD Ryzen--POPCNT new - Xing Liu - 2017-05-08
last reply Test results for AMD Ryzen--POPCNT new - Agner - 2017-05-11
replythread Test results for AMD Ryzen new - Justin - 2017-07-11
last reply EPYC new - Agner - 2017-07-11
replythread Test results for AMD Ryzen new - Lefty - 2017-07-12
last replythread Test results for AMD Ryzen new - Agner - 2017-07-12
replythread Test results for AMD Ryzen new - cvax - 2017-07-13
last reply Test results for AMD Ryzen new - Agner - 2017-07-13
last replythread Test results for AMD Ryzen new - Lefty - 2017-07-13
reply Test results for AMD Ryzen new - Agner - 2017-07-13
last replythread Test results for AMD Ryzen new - Travis - 2017-07-13
last reply Test results for AMD Ryzen new - Johannes - 2017-07-25
last replythread Test results for AMD Ryzen new - Conrad - 2017-09-22
reply Test results for AMD Ryzen new - Agner - 2017-09-22
last reply Test results for AMD Ryzen new - Travis - 2017-09-26