AMD Ryzen 5800
Posted: 2021-01-31, 16:22:39
I have now tested the AMD Zen 3 (Ryzen 5800) architecture.
The Zen 1 design from AMD was quite successful with substantial improvements over previous models. Zen 2 made significant improvements over Zen 1, and Zen 3 now turns out to be still faster. There are more execution units and several other improvements in Zen 3. AMD's claims about improved performance are basically confirmed by my tests. See link.
The throughput of the Zen 3 is now as high as six instructions per clock cycle. This may be six integer instructions or six floating point/vector instructions, or any mix of these. This is a record so far. It can do three memory operations per clock. The clock frequency is 3.8 GHz with boosts up to almost 5 GHz.
A serious bottleneck is a decoding rate of 4 instructions or 16 bytes per clock. To compensate for this, the Zen 3 has a micro-op cache with 4096 entries after the decoder.
The increased throughput in terms of instructions per clock may be difficult to utilize if the software has long dependency chains (where each calculation must wait for the result of the preceding one). It is now more important than ever to avoid long dependency chains.
The bottleneck in the decoder appears to be difficult to overcome. This is a consequence of the messy x86 code structure where instructions can have any length from 1 to 15 bytes, and it is complicated to determine the length of each instruction. Intel processors have the same bottleneck and the same decoding rate. The programmer must make sure the critical part of a program fits into this micro-op cache if you want to get the maximum throughput. It is important to avoid loop unrolling where possible in order to economize the use of the micro-op cache. (The Clang compiler often makes excessive loop unrolling).
The AMD Zen 3 has a higher instruction-per-clock throughput and a bigger micro-op cache than the best current Intel processors. This makes the Zen 3 the best choice for many applications. The Zen 3 does not support the AVX512 instruction set, however. Therefore, Intel processors are likely to be faster for software that can utilize the 512-bit vector instructions. AMD have focused on higher throughput where Intel have focused on larger vectors.
The Zen 2 had the surprising feature that it can mirror memory operands inside the CPU, as I have described here. The Zen 3 does not have this feature. This feature is no doubt costly in terms of hardware complexity and temporary registers. This feature is likely to be more useful in 32-bit mode than in 64-bit mode. Therefore, it makes sense to prioritize the hardware resources for other improvements.
I have made a detailed description of the Zen 3 architecture in my microarchitecture manual and my list of instruction timings (link).
The Zen 1 design from AMD was quite successful with substantial improvements over previous models. Zen 2 made significant improvements over Zen 1, and Zen 3 now turns out to be still faster. There are more execution units and several other improvements in Zen 3. AMD's claims about improved performance are basically confirmed by my tests. See link.
The throughput of the Zen 3 is now as high as six instructions per clock cycle. This may be six integer instructions or six floating point/vector instructions, or any mix of these. This is a record so far. It can do three memory operations per clock. The clock frequency is 3.8 GHz with boosts up to almost 5 GHz.
A serious bottleneck is a decoding rate of 4 instructions or 16 bytes per clock. To compensate for this, the Zen 3 has a micro-op cache with 4096 entries after the decoder.
The increased throughput in terms of instructions per clock may be difficult to utilize if the software has long dependency chains (where each calculation must wait for the result of the preceding one). It is now more important than ever to avoid long dependency chains.
The bottleneck in the decoder appears to be difficult to overcome. This is a consequence of the messy x86 code structure where instructions can have any length from 1 to 15 bytes, and it is complicated to determine the length of each instruction. Intel processors have the same bottleneck and the same decoding rate. The programmer must make sure the critical part of a program fits into this micro-op cache if you want to get the maximum throughput. It is important to avoid loop unrolling where possible in order to economize the use of the micro-op cache. (The Clang compiler often makes excessive loop unrolling).
The AMD Zen 3 has a higher instruction-per-clock throughput and a bigger micro-op cache than the best current Intel processors. This makes the Zen 3 the best choice for many applications. The Zen 3 does not support the AVX512 instruction set, however. Therefore, Intel processors are likely to be faster for software that can utilize the 512-bit vector instructions. AMD have focused on higher throughput where Intel have focused on larger vectors.
The Zen 2 had the surprising feature that it can mirror memory operands inside the CPU, as I have described here. The Zen 3 does not have this feature. This feature is no doubt costly in terms of hardware complexity and temporary registers. This feature is likely to be more useful in 32-bit mode than in 64-bit mode. Therefore, it makes sense to prioritize the hardware resources for other improvements.
I have made a detailed description of the Zen 3 architecture in my microarchitecture manual and my list of instruction timings (link).