Lefty wrote:
Thanks for the answer.
I have another question. I am wondering why AVX-256 /
AVX -512 is considered superior to AVX-128.
You can pack 2 AVX-128 instructions into one AVX-256
instruction (provided that the instructions are
independent), but it will not necessarily execute
faster. A CPU with one 256 bit SIMD unit can execute
the AVX-256 instruction in one cycle, however a CPU
with 2 128-bit SIMD units would just schedule the 2
AVX-128 instructions to execute simultaneously - also
in one cycle. I don't see where the advantage is.
In general there might not be an advantage if you are comparing a CPU that offers 2N execution units of width W versus one which offers N execution units of width 2W (e.g., 4 x 128-bit units versus 2 x 256-bit units) - but that's not usually the comparison you would see in actual hardware. In general it is much easier to extend the length of the vector units by 2x than it is to sustainably execute at double the IPC. Indeed, Intel chips have been "stuck" at 4-wide for nearly a decade despite increasing from 128-bits to 512-bits on the vector size. To double sustained IPC (in code that can provide the necessarily ILP in the first place) you'd have to have to approximately double fetch, decode, rename and retire throughput, and increase the size of many structures such as the ROB and PRF. Even then you might run out of registers in the ISA since you effectively need twice as many registers to keep the same amount of data "in flight". Many of these changes aren't just linear increases in hardware complexity, but quadratic or worse - and at some point they aren't even possible without reducing the clock frequency. Increasing the width of the SIMD units, on the other hand, is generally a straightforward linear increase in complexity (with the exception of some lane-crossing operations, which is why those often have a longer latency and are generally discouraged). |