Agner's CPU blog

Posted: **2022-08-02, 20:53:36**

If I have understood correctly AMD AVX-512 support is coming relatively cheap processors: More specify, Zen 4 Ryzen 7000 series.

Posted: **2022-09-06, 0:31:55**

I use an Alder Lake 12900K with E-cores disabled to do my day-to-day AVX 512 development and I was super excited for Zen 4 when it was rumored to have AVX 512. Then the big disappointment came when Papermaster said during the launch event it would only split AVX 512 into half and double pump it through the AVX 256 vector engine to save power, effectively cutting performance in half. Same technique they used on Zen and Zen+ to achieve ISA compatibility without the performance benefit. Super bummed about that.

Posted: **2022-09-06, 4:31:54**

Dannotech wrote:

it would only split AVX 512 into half and double pump it through the AVX 256 vector engine to save power

This was expected. It saves silicon space. It might still give a performance benefit in the decoding stage.

I will test the performance if somebody can give me remote access to a Zen 4

Posted: **2022-09-27, 5:53:29**

according to phoronix test, there is noticeable performance increase when using AVX-512 even when it uses 256bit units and unlike on intel, it does not dramatically increase power consumption or lower clocks
https://www.phoronix.com/review/amd-zen4-avx512

Posted: **2022-09-27, 18:27:10**

Recent Zen 4 data:

Zen 4's AVX-512 Teardown by Alexander J. Yee: https://www.mersenneforum.org/showthread.php?p=614191
Instruction latencies & CPUID dump: AMD Ryzen 9 7950X (Raphael, Zen 4) A60F12 x64
- https://github.com/InstLatx64/InstLatx6 ... LatX64.txt
- https://twitter.com/InstLatX64/status/1 ... 7366080512
Zen 3 vs Zen 4 vs Intel GoldenCove chart: https://pbs.twimg.com/media/FdmurRAX0AU ... =4096x4096
Memory mirroring is back with double rate: https://twitter.com/InstLatX64/status/1 ... 4877763584
VPMULLQ latency is just 3 clks in AMD Zen4 (vs 15 on on Intel GoldenCove)
- https://officedaytime.com/simd512e/simd ... ?f=vpmullq
- https://twitter.com/InstLatX64/status/1 ... 3818055680
x64 SIMD ISA support Euler diagram: https://twitter.com/InstLatX64/status/1 ... 3133575169

Posted: **2022-11-04, 14:12:02**

I have now had access to test a Zen 4, and the results are quite good. The Zen 4 is produced with a 5 nm process making it possible to run at a clock frequency of 4.5 - 5.7 GHz. The high clock frequency combined with a bigger micro-op cache, L2, and L3 caches, improved branch instructions, and several small improvements makes the Zen 4 a very fast processor.

The support for the new AVX512 instructions is quite good, and it includes many of the extra subsets of AVX512. Here, I have to correct a common misunderstanding. The Zen 4 does not execute a 512-bit vector instruction by using a 256-bit execution unit twice, but by using two 256-bit units simultaneously. It does not split a 512-bit instruction into two 256-bit micro-operations, like the Zen 1 that splits 256-bit instructions into two 128-bit micro-operations. The Zen 4 has four 256-bit execution units. Two of these units can do floating point addition, and the other two can do floating point multiplication. All four can do integer vector addition etc. This gives a maximum throughput for 512-bit vectors of one floating point vector multiplication and one floating point vector addition, or two integer vector additions, per clock cycle. This throughput is doubled for vectors of 256 bits or less. It is still advantageous to use 512-bit instructions if the throughput is limited by instruction decoding or micro-operation queues or code cache or something else. It is rare that execution unit throughput is the bottleneck.

The only downside of using AVX512 is the compare instructions. The AVX512 instruction set is storing the result of a vector compare instruction in a special mask register, where earlier instruction sets use a normal vector register. The compare instructions with mask register results have longer latencies than the legacy SSE and AVX compare instructions. On the other hand, you can use a mask register to select or disable individual elements of a vector at zero cost.

The Zen 2 has the impressive feature that it can mirror a memory operand in a temporary register inside the CPU so that it can be accessed without delay. See my description of this feature in another thread. This feature was removed in the Zen 3, but it has now come back in the Zen 4 with some improvement.

A new feature in Zen 4 is that a NOP (no operation) instruction can fuse with a preceding instruction in the decoder so that it uses no resources in the rest of the pipeline. A long sequence of NOPs will fuse together in pairs so that you get one micro-operation for every two NOPs. The throughput for a sequence of NOPs is 6 micro-operations, corresponding to 12 NOPs per clock cycle. This is an interesting feature, though the advantage for normal code is limited.

The details of my test results are described in my microarchitecture manual. Instruction latencies and throughput are listed in my instruction tables.

Posted: **2022-11-05, 11:39:52**

RobertS wrote: ↑
2022-09-27, 5:53:29
https://www.phoronix.com/review/amd-zen4-avx512

People in the forum theme discussion it mentitiong "AVX-512 predication", like

That said, I've long been critical of AVX-512, or at least the aspect of it which involves widening vectors to 512-bit. Other things, like predication and scatter/gather, are indeed nice and maybe not hugely expensive in die area.

I know it's supposed to have some features, like predication and scatter/gather, which make it more efficient. It also doubles the number of vector registers, in addition to increasing their width.

What is predication they're all talking about? I checked this keywoard in AVX manuals and didn't found the answer.

Posted: **2022-11-05, 22:12:08**

Hi Agner. Thanks for all your work!

This is Alex, the author of the mersenneforum post. When I started testing the chip back in August, I initially suspected that 512-bit was split onto 2 x 256-bit pipes simultaneously on the same cycle. But after some further testing, I actually ruled that out.

There are 3 x 256-bit shuffle pipes for simple shuffles. (like VUNPCKLPS) So you can sustain 3 x 256-bit each cycle. But when testing the 512-bit version, the throughput is 1.5/cycle. If 512-bit was split into the different pipes on the same cycle, it should be 1.0/cycle as you wouldn't be able to use the 3rd pipe.

When mixing 256-bit and 512-bit arithmetic (non-shuffle) instructions in close proximity, I get the full theoretical throughput. Which implies either that 512-bit instructions can either split onto different cycles, or the reordering capability is able to fill holes created by lone 256-bit instructions. When I repeat this test with 512-bit all-to-all shuffles, I get less than full throughput - which (assuming the 512-bit shuffle is monolithic), suggests that the reordering capability is not fully able to reorder 256-bit and 512-bit instructions to fill every 256-bit hole.

Have you done similar tests?

Posted: **2022-11-06, 6:21:24**

Vladislav_152378 wrote:

What is predication they're all talking about?

It is the use of masks. A mask can selectively enable or disable individual vector elements.

Mysticial wrote:

If 512-bit was split into the different pipes on the same cycle, it should be 1.0/cycle as you wouldn't be able to use the 3rd pipe.

Yes, it can use all three 256-bit shuffle units when doing 512-bit shuffles. This means that the two halves of the 512-bit register are not necessarily executed simultaneously if only one 256-bit unit is vacant in the first clock cycle. This applies only to instructions that shuffle data within each of the four 128-bit lanes. Shuffle instructions that can exchange data between the four 128-bit lanes (what you call all-to-all shuffles, e.g. VPERMI2B) can only use two of the pipes (pipe 1 and 2). These two units can exchange data with each other. The throughput for such instructions is two 256-bit shuffles or one 512-bit shuffle per clock cycle. I assume that the two halves must execute simultaneously if they can exchange data with each other. The latency for such shuffles is longer: 3, 4, or 5 clock cycles for 128, 256, and 512 bits respectively. This indicates that the internal wiring geometry is optimized for the most common data flows, which is within 128-bit lanes, while data that must cross the 128-bit borders have longer paths to follow. This is the case for Intel processors as well.

Interestingly, instructions that generate a mask register result, e.g. VPCMPEQB has the same latencies: 3, 4, or 5 clock cycles for 128, 256, and 512 bits respectively. This may indicate that mask results follow a similar route across the 128-bit lanes. Instructions with a mask, e.g. PADDB zmm1{k1}, zmm2, zmm3, have no extra latency, though.

Posted: **2023-03-24, 1:54:20**

AMD has since released their optimization guide for Zen4. And it says:

The floating point unit supports AVX-512 with 512-bit macro ops and 512-bit storage in the register file. Because the data paths are 256 bits wide, the scheduler uses two consecutive cycles to issue a 512-bit operation. The peak FLOPS are the same for 512-bit operations compared to 256-bit operations, but there is reduced usage of scheduler, register file, and other queue entries. This gives the opportunity for greater performance and reduced power.

https://www.amd.com/en/support/tech-doc ... chitecture

This, and a related discussion here (https://www.realworldtech.com/forum/?th ... tid=209221) seems to disagree with your finding that 512-bit can be split and executed simultaneously on different EUs.

Thoughts? Or maybe we're missing something that you're seeing?

The feeling I get is that Zen4's implementation for 512-bit is very minimal. Which makes sense engineering-wise if the goal was to decently support AVX512 with minimal changes that don't require completely tearing up the design of Zen3 and starting from scratch.

The mental model I currently have is this:

Zen4 is largely the same as Zen3. Same execution units (except the big shuffle). Same datapaths. High-level almost the same everywhere.
512-bit instructions are send down the the same 256-bit pipe on consecutive cycles.
Split 512-bit instructions stay together (staggered) on consecutive cycles everywhere it goes. (datapaths, EUs, register file read/writes)
Despite one half being a cycle behind the other half, latencies of non-lane-crossing instructions stay the same because they never need to wait for each other.
The 512-bit shuffle, when executing an instruction with cross-half dependencies, will wait a cycle so that both halves have arrived. When it's done it will write-back/forward to wherever it needs to go on consecutive cycles so everything stays together again.

There are some design advantages of such an approach which make me think it may be closer to the actual design:

Keeping 512-bit together on consecutive cycles means you only need to track one uop. Splitting them into pairs of 256-bit to be scheduled into different EUs adds complexity as you need to track two things in different queues and merge them again later. This can also lead to bubbles if one half hits contention while the other does not. (As mentioned in an earlier post, I found no such bubbles unless it involved 512-bit shuffles with cross-half dependencies.) Basically, the scheduler doesn't need to change much. It just needs to know what instructions "occupy 2 cycles instead of 1" - everything else remains the same.

Widening the register file to 512-bit probably isn't that complicated. You only need to double up the storage gates and mux the halves so you can access each half on different cycles. The datapaths stay 256-bit so there's no need to draw additional traces into the EUs. By comparison, Intel's implementation has to have the full 512-bit datapaths to the register file since it can read/write the entire register in one cycle.

Agner's CPU blog

AMD Zen 4 Ryzen 7000 series with AVX-512 support

AMD Zen 4 Ryzen 7000 series with AVX-512 support

Re: AMD Zen 4 Ryzen 7000 series with AVX-512 support

Re: AMD Zen 4 Ryzen 7000 series with AVX-512 support

Re: AMD Zen 4 Ryzen 7000 series with AVX-512 support

Re: AMD Zen 4 Ryzen 7000 series with AVX-512 support

Test results for AMD Zen 4

Re: AMD Zen 4 Ryzen 7000 series with AVX-512 support

Re: AMD Zen 4 Ryzen 7000 series with AVX-512 support

Re: AMD Zen 4 Ryzen 7000 series with AVX-512 support

Re: AMD Zen 4 Ryzen 7000 series with AVX-512 support