Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog |

thread Future instruction set: AVX-512 - Agner - 2013-10-09
last replythread Future instruction set: AVX-512 - Elhardt - 2013-10-25
last reply Future instruction set: AVX-512 - Agner - 2013-10-26
Future instruction set: AVX-512
Author: Agner Date: 2013-10-09 09:36

Intel have announced the next big instruction set extension, AVX512 to be implemented in 2015 or 2016. The details are defined in Intel Architecture Instruction Set Extensions Programming Reference. There are many interesting extensions:

  • The size of vector registers are extended from 256 bits (YMM registers) to 512 bits (ZMM) registers. There is room for further extensions to at least 1024 bits (what will they be called?)
  • The number of vector registers is doubled to 32 registers in 64-bit mode. There will still be only 8 vector registers in 32-bit mode.
  • Eight new mask registers k0 - k7 allow masked and conditional operations. Most vector instructions can be masked so that it only operates on selected vector elements while the remaining vector elements are unchanged or zeroed. This will replace the use of vector registers as masks.
  • Most vector instructions with a memory operand have an option for broadcasting a scalar operand.
  • Floating point vector instructions have options for specifying the rounding mode and for suppressing exceptions.
  • There is a new addressing mode called compressed displacement. Where instructions have a memory operand with a pointer and an 8-bit sign-extended displacement, the displacement is multiplied by the size of the operand. This makes it possible to address a larger interval with just a single byte displacement as long as the memory operands are properly aligned. This makes the instructions smaller in some cases to compensate for the longer prefix.
  • More than 100 new instructions
  • The 512-bit registers can do vector operations on 32-bit and 64-bit signed and unsigned integers and single and double precision floats, but unfortunately not on 8-bit and 16-bit integers.

A year ago, Intel announced a similar instruction set with 512-bit registers in Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual. The two instruction sets are very similar, both are backwards compatible, but they are not compatible with each other. The two instruction sets differ by a single prefix bit, even for otherwise identical instructions. I assume that the Knights Corner or Xeon Phi instruction set will have a short life and be replaced by AVX512.

The AVX512 instruction set uses a new 4-bytes prefix named EVEX, which is similar to the 2- or 3-bytes VEX prefix, but with 62 (hexadecimal) as the first byte. (Actually, I predicted several years ago that the 62 byte would be used for such a prefix because it was the only remaining byte that could be used in the same way as the VEX prefix bytes). The extra bits in the EVEX prefix are used for doubling the number of registers, for specifying vector size, and for the extra features of broadcasting, masking, zeroing, specifying rounding mode, and suppressing floating point exceptions.

The calling conventions for the new registers are partially defined in a draft ABI, but it is still discussed whether the new registers should have callee save status, see Gnu libc-alpha mailing list.

I have commented on the AVX512 instruction set and suggested various improvements at Intel's blog and Intel's forum.

The new instruction sets are supported by my objconv disassembler.

Future instruction set: AVX-512
Author: Elhardt Date: 2013-10-25 15:57
Hello Agner. You've mentioned most of the important improvements that AVX512 with bring us. However, you've missed an important one that I also think should have been mentioned. AVX512 will include reciprocal estimates that are accurate to 2 ^ -28. That means for single precision floating point, no time consuming Newton-Raphson refinement needs to be done. This can be a major speed boost for division ( and square roots also have the more accurate estimation too ). Intel's divisions have gotten a lot faster over the years to the point where they appear to be faster than the reciprocal / Newton-Raphson method. But now it looks like using the new reciprocal estimation is a way to leap ahead of divide instructions to gain more speed again.
Future instruction set: AVX-512
Author: Agner Date: 2013-10-26 01:47
Elhardt wrote:
AVX512 will include reciprocal estimates that are accurate to 2 ^ -28.

AVX512 will have instructions for calculating reciprocals and reciprocal squareroot with a precision of 2-14. A subsequent AVX512ER have reciprocals and reciprocal squareroot with a precision of 2-28 and exponential function with a precision of 2-23.