Surprising new feature in AMD Ryzen 3000

News and research about CPU microarchitecture and software optimization
agner
Site Admin
Posts: 76
Joined: 2019-12-27, 18:56:25
Contact:

Surprising new feature in AMD Ryzen 3000

Post by agner » 2020-08-27, 14:27:09

I have just finished testing the AMD Zen 2 CPU. The results are in my microarchitecture manual and my instruction tables https://www.agner.org/optimize/#manuals.

I discovered that the Zen 2 has a new surprising feature that we have never seen before: It can mirror the value of a memory operand inside the CPU so that it can be accessed with zero latency.

This assembly code shows an example:

Code: Select all

mov dword [rsi], eax
add dword [rsi], 5
mov ebx, dword [rsi]
When the CPU recognizes that the address [rsi] is the same in all three instructions, it will mirror the value at this address in a temporal internal register. The three instructions are executed in just 2 clock cycles, where it would otherwise take 15 clock cycles.

It can even track an address on the stack while compensating for changes in the stack pointer across push, pop, call, and return instructions. This is useful in 32-bit mode where function parameters are pushed on the stack. A simple function can read its parameters without waiting for the values to be stored on the stack and read back again. This does not work if the stack pointer is modified by any other instructions or copied to a frame pointer. Therefore, it doesn't work with functions that set up a stack frame.

The mechanism works only under certain conditions. It must use general purpose registers, and the operand size must be 32 or 64 bits. The memory operand must use a pointer and optionally an index. It does not work with absolute or rip-relative addresses.

It seems that the CPU makes assumptions about whether memory operands have the same address before the addresses have been calculated. This may cause problems in case of pointer aliasing. If the second instruction in the above example has a different pointer register with the same value, you have a problem of pointer aliasing. The CPU assumes that the addresses are different so that the value of eax is directly forwarded to ebx without adding 5. It takes 40 clock cycles to undo the mistake and redo the correct calculation.

Yet, this is a pretty amazing feature. Imagine how complicated it is to implement this in hardware without adding any latency. I wonder why this feature is not mentioned in any AMD documents or promotion material. At least, I can't find any mentioning of this anywhere. AMD has something they call superforwarding, but this must be something else because it applies only to floating point registers.

Other interesting results for the Zen 2:
The vector execution units and data paths are all extended from 128 bits to 256 bits. A typical 256-bit AVX instruction is executed with a single micro-op, while the Zen 1 would split it into two 128-bit micro-ops. The throughput for 256-bit vector instructions is now as high as two floating point vector additions and two multiplications per clock cycle.

There is also an extra memory AGU so that it can do two 256-bit memory reads and one 256-bit write per clock cycle.

The maximum overall throughput for a mix of integer and vector instructions is five instructions or six micro-ops per clock for loops that fit into the micro-op cache. Long loops that don't fit into the micro-op cache are limited by a fetch rate of up to 16 bytes or four instructions per clock. Intel processors have a similar limitation, and this is a very likely bottleneck for CPU intensive code.

All caches are big, the clock frequency is high, and you can get up to 64 cores. All in all, this is a quite competitive CPU as long as your software does not utilize the AVX512 instruction set. The software market is generally slow to adopt to new instruction sets, so I guess it makes economic sense for AMD to lag behind Intel in the race for new instruction sets and longer vector registers.

Nildawenn
Posts: 1
Joined: 2020-08-28, 13:32:52

Re: Surprising new feature in AMD Ryzen 3000

Post by Nildawenn » 2020-08-28, 13:46:26

With your code example, could this be effectively used by the compiler to force the cpu to ensure an address is in cache by the time a final store comes about?

For example, if you intentionally use the destination address as temporary storage (even if unnecessary), does it begin fetching the address on first reference, or will the address be brought into L1 only when the core sees a need to write back to cache (perhaps on a final store in the sequence that can be mirrored)?

Do you know how many addresses can be simultaneously mirrored like this?

agner
Site Admin
Posts: 76
Joined: 2019-12-27, 18:56:25
Contact:

Re: Surprising new feature in AMD Ryzen 3000

Post by agner » 2020-08-28, 15:34:49

Nildawenn, you don't need this feature to load a cache line. Any read, write, or prefetch to this address will do. It is not advisable to rely on an undocumented feature that may be different in the next processor.
Do you know how many addresses can be simultaneously mirrored like this?
I cannot find a limit. Writing to hundreds of different addresses in between doesn't make the problem of pointer aliasing go away, even if the number of addresses exceeds the number of internal temporary registers.

jtsmith
Posts: 3
Joined: 2020-08-28, 19:27:27

Re: Surprising new feature in AMD Ryzen 3000

Post by jtsmith » 2020-08-28, 19:34:46

Although I had to double-check about the equivalence of the test cases, this "memory renaming" got noticed and examined earlier in this year by some of the usual suspects:

https://gist.github.com/travisdowns/bc9 ... 82f85b9e9c
https://pvk.ca/Blog/2020/02/01/too-much ... orwarding/

Both Zen 2 and Ice Lake have been verified as handling bypass transformations for at least isolated store/load pairs, which the store/increment-at-addr/load test case is apparently equivalent to just 2 back-to-back instances of.

dt_cpu
Posts: 2
Joined: 2020-08-29, 4:04:55

Re: Surprising new feature in AMD Ryzen 3000

Post by dt_cpu » 2020-08-29, 4:42:03

Is there evidence that the core can handle stack manipulations (push & pop) and still forward values directly? Given the presence of the stack engine in the front end, I'm sorta expecting the two features to play together nicely.

agner
Site Admin
Posts: 76
Joined: 2019-12-27, 18:56:25
Contact:

Re: Surprising new feature in AMD Ryzen 3000

Post by agner » 2020-08-29, 6:27:59

jtsmith: Thanks for the references. This fits nicely with my findings. It is amazing that this works with zero latency even with read-modify-write instructions and stack instructions. Too bad, though, that the penalty is so high if it fails.

dt_cpu: Yes, the mechanism seems to work perfectly together with the stack engine. For example: push eax / call function / mov eax,[esp+4]. This scenario often occurs in 32-bit code.

jtsmith
Posts: 3
Joined: 2020-08-28, 19:27:27

Re: Surprising new feature in AMD Ryzen 3000

Post by jtsmith » 2020-08-29, 21:48:49

I don't think there is anything that can be done about the speculation squash penalty, but I'm skeptical that compilers would emit many naturally-aliasing writes due to register allocation conservatism alone. Additionally, this renaming doesn't make much sense to employ for loads reasonably likely to execute more than 3 cycles after the matching store, so I'd hope there are heuristics to promptly expire any front end tracking state it needs, leaving a rather narrow window for aliasing to slide in.

As some others elsewhere have pointed out, the intervening modify is probably not anything noteworthy beyond the base case of a store-load bypass elision, since it's just a load-to-temp-reg uop followed by a new store wholly replacing any uarch state for the first store's elision handling.

dt_cpu
Posts: 2
Joined: 2020-08-29, 4:04:55

Re: Surprising new feature in AMD Ryzen 3000

Post by dt_cpu » 2020-08-29, 21:58:14

agner: Neat!
Is the throughput of pairs of stack operations (or equivalent mov patterns) still limited by the single store port? I'm wondering whether they support store vectorization of stores to consecutive addresses. Store port certainly has enough bw. This would be an orthogonal but complementary optimization, not the memfile directly.

agner
Site Admin
Posts: 76
Joined: 2019-12-27, 18:56:25
Contact:

Re: Surprising new feature in AMD Ryzen 3000

Post by agner » 2020-09-02, 13:48:42

jtsmith:
I think the speculative store-to-load forwarding is active until the store has expired. The store can probably stay in the store queue for quite a while if we have speculative execution, which we have after a branch prediction or a potential f.p. trap.

A sequence of consecutive read-modify-write instructions can execute at one instruction per clock. This means that we can have a long chain of speculative store-to-load forwardings.

dt_cpu:
The throughput is still limited by the store port to one memory write per clock.

There is no vectorization of stores to consecutive addresses.

I made some more tests showing that the offset must be divisible by 4 in 32 bit mode and by 8 in 64 bit mode. I have updated the manuals with this and a few other revisions.

jtsmith
Posts: 3
Joined: 2020-08-28, 19:27:27

Re: Surprising new feature in AMD Ryzen 3000

Post by jtsmith » 2020-09-06, 20:35:02

What do you mean by a store expiring? Committing, draining to the cache, or being overwritten in the store buffer ring? For normal load forwarding, it's desirable to support matching as long as possible, even beyond the buffer entry already being copied to L1, to save the most bandwidth and latency possible. There is no additional resource utilization beyond the MOB/store buffer when allowing matching that late.

In contrast, the new forwarding renaming mechanism seems likely to need to hold onto a store data source physical register until a given store's potential renaming is disabled due to a subsequent write to the same address or exhaustion of SB slots or scalar registers. Keeping store source physical registers alive until the SB wraps has an upper consumption bound of 48 out of 180 physical scalar registers in Zen 2, so I have curious if AMD artificially expires the renaming before hitting that limit.

Post Reply