AMD Zen 4 Ryzen 7000 series with AVX-512 support
Posted: 2022-08-02, 20:53:36
If I have understood correctly AMD AVX-512 support is coming relatively cheap processors: More specify, Zen 4 Ryzen 7000 series.
News and research about CPU microarchitecture and software optimization
https://agner.org/forum/
This was expected. It saves silicon space. It might still give a performance benefit in the decoding stage.it would only split AVX 512 into half and double pump it through the AVX 256 vector engine to save power
People in the forum theme discussion it mentitiong "AVX-512 predication", like
That said, I've long been critical of AVX-512, or at least the aspect of it which involves widening vectors to 512-bit. Other things, like predication and scatter/gather, are indeed nice and maybe not hugely expensive in die area.
What is predication they're all talking about? I checked this keywoard in AVX manuals and didn't found the answer.I know it's supposed to have some features, like predication and scatter/gather, which make it more efficient. It also doubles the number of vector registers, in addition to increasing their width.
It is the use of masks. A mask can selectively enable or disable individual vector elements.What is predication they're all talking about?
Yes, it can use all three 256-bit shuffle units when doing 512-bit shuffles. This means that the two halves of the 512-bit register are not necessarily executed simultaneously if only one 256-bit unit is vacant in the first clock cycle. This applies only to instructions that shuffle data within each of the four 128-bit lanes. Shuffle instructions that can exchange data between the four 128-bit lanes (what you call all-to-all shuffles, e.g. VPERMI2B) can only use two of the pipes (pipe 1 and 2). These two units can exchange data with each other. The throughput for such instructions is two 256-bit shuffles or one 512-bit shuffle per clock cycle. I assume that the two halves must execute simultaneously if they can exchange data with each other. The latency for such shuffles is longer: 3, 4, or 5 clock cycles for 128, 256, and 512 bits respectively. This indicates that the internal wiring geometry is optimized for the most common data flows, which is within 128-bit lanes, while data that must cross the 128-bit borders have longer paths to follow. This is the case for Intel processors as well.If 512-bit was split into the different pipes on the same cycle, it should be 1.0/cycle as you wouldn't be able to use the 3rd pipe.
https://www.amd.com/en/support/tech-doc ... chitectureThe floating point unit supports AVX-512 with 512-bit macro ops and 512-bit storage in the register file. Because the data paths are 256 bits wide, the scheduler uses two consecutive cycles to issue a 512-bit operation. The peak FLOPS are the same for 512-bit operations compared to 256-bit operations, but there is reduced usage of scheduler, register file, and other queue entries. This gives the opportunity for greater performance and reduced power.