Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Future instruction set: AVX-512 - Agner - 2013-10-09
replythread Future instruction set: AVX-512 - Elhardt - 2013-10-25
last reply Future instruction set: AVX-512 - Agner - 2013-10-26
last reply Future instruction set: AVX-512 - Agner - 2014-10-08
 
Future instruction set: AVX-512
Author: Agner Date: 2013-10-09 09:36

Intel have announced the next big instruction set extension, AVX512 to be implemented in 2015 or 2016. The details are defined in Intel Architecture Instruction Set Extensions Programming Reference. There are many interesting extensions:

  • The size of vector registers are extended from 256 bits (YMM registers) to 512 bits (ZMM) registers. There is room for further extensions to at least 1024 bits (what will they be called?)
  • The number of vector registers is doubled to 32 registers in 64-bit mode. There will still be only 8 vector registers in 32-bit mode.
  • Eight new mask registers k0 - k7 allow masked and conditional operations. Most vector instructions can be masked so that it only operates on selected vector elements while the remaining vector elements are unchanged or zeroed. This will replace the use of vector registers as masks.
  • Most vector instructions with a memory operand have an option for broadcasting a scalar operand.
  • Floating point vector instructions have options for specifying the rounding mode and for suppressing exceptions.
  • There is a new addressing mode called compressed displacement. Where instructions have a memory operand with a pointer and an 8-bit sign-extended displacement, the displacement is multiplied by the size of the operand. This makes it possible to address a larger interval with just a single byte displacement as long as the memory operands are properly aligned. This makes the instructions smaller in some cases to compensate for the longer prefix.
  • More than 100 new instructions
  • The 512-bit registers can do vector operations on 32-bit and 64-bit signed and unsigned integers and single and double precision floats, but unfortunately not on 8-bit and 16-bit integers.

A year ago, Intel announced a similar instruction set with 512-bit registers in Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual. The two instruction sets are very similar, both are backwards compatible, but they are not compatible with each other. The two instruction sets differ by a single prefix bit, even for otherwise identical instructions. I assume that the Knights Corner or Xeon Phi instruction set will have a short life and be replaced by AVX512.

The AVX512 instruction set uses a new 4-bytes prefix named EVEX, which is similar to the 2- or 3-bytes VEX prefix, but with 62 (hexadecimal) as the first byte. (Actually, I predicted several years ago that the 62 byte would be used for such a prefix because it was the only remaining byte that could be used in the same way as the VEX prefix bytes). The extra bits in the EVEX prefix are used for doubling the number of registers, for specifying vector size, and for the extra features of broadcasting, masking, zeroing, specifying rounding mode, and suppressing floating point exceptions.

The calling conventions for the new registers are partially defined in a draft ABI, but it is still discussed whether the new registers should have callee save status, see Gnu libc-alpha mailing list.

I have commented on the AVX512 instruction set and suggested various improvements at Intel's blog and Intel's forum.

The new instruction sets are supported by my objconv disassembler.

   
Future instruction set: AVX-512
Author: Elhardt Date: 2013-10-25 15:57
Hello Agner. You've mentioned most of the important improvements that AVX512 with bring us. However, you've missed an important one that I also think should have been mentioned. AVX512 will include reciprocal estimates that are accurate to 2 ^ -28. That means for single precision floating point, no time consuming Newton-Raphson refinement needs to be done. This can be a major speed boost for division ( and square roots also have the more accurate estimation too ). Intel's divisions have gotten a lot faster over the years to the point where they appear to be faster than the reciprocal / Newton-Raphson method. But now it looks like using the new reciprocal estimation is a way to leap ahead of divide instructions to gain more speed again.
   
Future instruction set: AVX-512
Author: Agner Date: 2013-10-26 01:47
Elhardt wrote:
AVX512 will include reciprocal estimates that are accurate to 2 ^ -28.

AVX512 will have instructions for calculating reciprocals and reciprocal squareroot with a precision of 2-14. A subsequent AVX512ER have reciprocals and reciprocal squareroot with a precision of 2-28 and exponential function with a precision of 2-23.

   
Future instruction set: AVX-512
Author: Agner Date: 2014-10-08 11:02
Agner wrote:
The 512-bit registers can do vector operations on 32-bit and 64-bit signed and unsigned integers and single and double precision floats, but unfortunately not on 8-bit and 16-bit integers.
The latest update of Intel's manual specifies a future instruction set named AVX512BW which has vectors of 32 16-bit integers or 64 8-bit integers. See software.intel.com/en-us/intel-isa-extensions.

The AVX512 instruction set will be divided into several subsets: AVX512BW for vector instructions with 8-bit (Byte) and 16-bit (Word) granularity; AVX512DQ for 32-bit (Dword or float) and 64-bit (Qword or double) granularity; AVX512VL for the same instructions with 128 bit and 256 bit total vector length; and various other subsets.

The Skylake processor, planned for 2015, will probably support all these subsets, while the Knights Landing multiprocessor will not support the BW subset, according to this announcement software.intel.com/en-us/blogs/additional-avx-512-instructions.

A 512-bit vector with 8-bit granularity will have 64 elements and require 64-bit mask registers. The mask registers are officially 64-bit architectural registers, according to the manual. It is not clear what architectural means, but it usually means something that is guaranteed to be supported in future processors. This raises the question about the possibility of future extensions. If future extensions to 1024 or 2048 bit vectors will support 8-bit and 16-bit granularity then the mask registers must be bigger so that they can no longer communicate nicely with the 64-bit general purpose registers. If there will be future extensions of the vector size at all, either they will have only 32-bit and 64-bit granularity, or the mask registers will have to be redesigned.