russian review https://3dnews.ru/954174 , as usual, has more thorough low-level benchmarks than anand. In particular, important test: https://3dnews.ru/assets/external/illustrations/2017/06/19/954174/avx-512.png As we can see here, FP computations got almost 2x speedup, while INT got only 20-40% improvements I think, the last result perfectly lines with my prediction - port5 was extended to 512 bits, so bit shuffling becomes 2x faster, and PADD group got 33% boost. I expected 10-20% overall speedup, but probably new AVX512 features (new instructions, built-in masking) further improved the performance My last prediction was: "also it's easy to predict that in the next generations the first "improvement" will be to add FMAD capability to port 5, further doubling the marketing performance figures" I don't expected it in Skylake generation due to excessive TDP increase (as we know, even using AVX2 on previous generations increased TDP by 40%, so two full-featured AVX512 ports should *further* increase TDP by 80%!). Nevertheless, they have done exactly that, and got very expected TDP problems. Note that from 3dnews test, we can draw conclusion that port5 added only FMA engine, but no other AVX512 commands (except for mere extension of AVX2 commands already populated on this port) So, i can say that my speculation turned to be 200% right :)
But refreshing all that we know, it seems that from technical VP, skylake is a total mess! The SKL architecture i predicted was compromise - it added as little as possible hardware unused in AVX256 mode, but still had AVX512 support. It was a great step toward future processors - add 512-bit support for forward compatibility, but don't invest heavily in AVX512-only hardware until more 512-bit programs will arrive. To reach this goal, they made some changes that were bad for AVX2 programs (see my second post)
But when they added the second FMA512 engine, this became meaningless. Now we have design that both limits AVX2 performance and has a lot of hardware unused in AVX2 mode! By simple extending Haswell engines 2x we can got a bit higher transistor count and much better AVX512 performance I think this is result of marketing games - SKL-S already had AVX512 support (without second FMA engine, though), but they decided to disable it on all SKUs. Newer SKL-X added the second engine, but enabled it only on selected SKUs, so i7 provides exactly the architecture i predicted (and probably it was their Plan B - use SKL-S cores with a single FMA engine for HEDT/Xeon products) Now we can also see why SKL-S reduced L2$ associativity to 4. It was preparation to increasing cache size - SKL-S cache is just a quarter of SKL-X cache with the same organization, and reduced associativity allowed to reduce transistor budget of massive 1MB cache. This is a sign that SKL-X is much smaller modification of SKL-S core than we can think at the first sight |