Agner`s CPU blog

Test results for Knights Landing - Agner - 2016-11-26

Test results for Knights Landing - Nathan Kurz - 2016-11-26

Test results for Knights Landing - Tom Forsyth - 2016-11-27

Test results for Knights Landing - SÃ¸ren Egmose - 2016-11-27

Test results for Knights Landing - Agner - 2016-11-30

Test results for Knights Landing - Joe Duarte - 2016-12-03

Test results for Knights Landing - Agner - 2016-12-04

Test results for Knights Landing - Constantinos Evangelinos - 2016-12-05

Test results for Knights Landing - John McCalpin - 2016-12-06

Test results for Knights Landing - Agner - 2016-12-06

Test results for Knights Landing - John McCalpin - 2016-12-08

Test results for Knights Landing - Joe Duarte - 2016-12-07

Test results for Knights Landing - zboson - 2016-12-28

VZEROUPPER - Agner - 2016-12-28

Test results for Knights Landing - Ioan Hadade - 2017-07-13

Test results for Knights Landing - Agner - 2017-07-13

INC/DEC throughput - Peter Cordes - 2017-10-09

INC/DEC throughput - Agner - 2017-10-10

Test results for Knights Landing

Author: Agner

Date: 2016-11-26 08:39

The Knights Landing is Intel's new "Many Integrated Core" processor. It has 64-72 cores that can run four threads each. It is built with a 14 nm process and runs at a clock frequency of 1.3-1.5 GHz. It is intended for processing large data sets in parallel. It is only useful, of course, if the calculations can easily be split up into multiple threads that can run in parallel.

Each core is lightweight, based on an extension of the Silvermont low power architecture. Each core runs slower than a desktop CPU, but with a large number of cores we can still get a high overall performance. It has 32 kB of level-1 code cache and 32 kB level-1 data cache per core; 1 MB of level-2 cache shared between two cores each; and 16 GB of MCDRAM inside the package. The MCDRAM can be configured as a level-3 cache or as main memory.

The predecessor, Knights Corner, was not very impressive and it had its own instruction set. The Knights Landing is the first processor with the new AVX512 instruction set. It is expected that AVX512 will be the standard for future x86 processors so that the Knights Landing will be binary compatible with mainstream microprocessors. It also supports the previous instructions sets AVX2, etc.

The AVX512 instruction set seems to be quite efficient. It has 32 vector registers of 512 bits each, where AVX2 has only 16 registers of 256 bits each. It also has a new set of eight mask registers that can be used for conditional execution of each element of a vector. Almost all vector instructions can be masked. This works quite efficiently. The latencies and throughputs of vector instructions are the same with or without a mask, and independent of the value of the mask register. The Gnu compiler optimizes this quite well, so that for example an addition and an if can be merged together into a single add instruction with a mask.

On the positive side, the Knights Landing has true out-of-order processing (unlike Knighs Corner and Silvermont). It has a good memory throughput. It can do two 512-bit vector reads, or one read and one write, per clock cycle. The throughput for simple vector instructions is two 512-bit vectors per clock.

The Knights Landing has an instruction set extension, AVX512ER, with some quite impressive math instructions. It can calculate a reciprocal, a reciprocal square root, and even an exponential function, on a vector of 16 floats in just a few clock cycles. The manual has a somewhat confusing description of the accuracy of these instructions. My measurements showed that these instructions are accurate to the last bit on single precision floats, while they give only approximate results for double precision. These instructions are useful for neural networks and other large low-precision math applications.

On the negative side, all vector instructions have a latency of at least 2 clock cycles, where earlier processors have a latency of 1 for simple vector instructions. Integer instructions on general purpose registers have a latency of 1. A possible explanation for this difference is that the integer reservation station can hold source data, while the floating point reservation station cannot. This means that an integer ALU can write its result directly to any subsequent micro-op in the reservation station that needs it, while results in the floating point unit have to go via the floating point register file. The size of the vector operands is simply to large to make it practical to store the values in the reservation station.

Almost all instructions that generate more than one micro-op are microcoded. The performance of microcode is not good. All microcoded instructions take 7 clock cycles or more. This includes most of the legacy x87 floating point instructions. You should avoid legacy x87 code. Floating point division is also relatively slow (32 clock cycles for a vector division).

The instruction decoder is likely to be a bottleneck. It can decode a maximum of two instructions or 16 bytes of code per clock cycle.

When AVX was introduced with 256-bit vector registers, we were told to use the instruction VZEROUPPER to avoid a severe penalty when switching between VEX and non-VEX code. Four generations of Intel processors had such a penalty (Sandy Bridge, Ivy Bridge, Haswell, and Broadwell). AMD processors and later Intel processors (Skylake and Knights Landing) do not have such a state switch. They have no need for the VZEROUPPER. Unfortunately, the VZEROUPPER is quite costly on Knights Landing. The recommendations from Intel are conflicting here. The Intel optimization manual recommends VZEROUPPER when switching between AVX and SSE code, but elsewhere in the same manual the say that you should not use VZEROUPPER on Knights Landing. This conflict is currently not resolved (see my discussion in Intels developer zone).

I am somewhat sceptical about the extensive use of hyperthreading - Intel's word for running multiple threads in the same core. What is the point of running four threads in a CPU core with a limited bandwidth of two instructions per clock cycle? This wouldn't be useful for CPU intensive code, but perhaps for code that is limited by memory access, branch mispredictions, or long dependency chains. Hyperthreading has a hazard that is often ignored. If four threads are running in the same core then each thread gets only a quarter of the CPU resources. I have seen a high-priority thread running at quarter speed because three other low priority threads were running in the same core. This is certainly not an optimal use of resources, and current operating systems are unable to avoid this problem. There is little you can do in a multi-user or multi-process system to prevent low priority threads from stealing resources from high priority threads. It may actually be better to turn off hyperthreading completely in the BIOS setup. There is also a security issue here: One thread will be able to detect what kind of code is running in another thread in the same core by detecting which CPU resources are fully used and which ones are unused.

My optimization manuals have been updated with test results and instruction timings for the Knights Landing and some more general information about AVX512 (link).

My assembly function library has been updated with memcpy, memmove, memset, and memcmp functions optimized for AVX512 (link).

My vector class library has been updated with improved support for AVX512 (link).

Author: SÃ¸ren Egmose	Date: 2016-11-27 03:13
Considering that OpenPower already supports 8 hardware threads per core this is apparently an acceptable way to go. This also opens a new approach to multithreading where you can have the threads on a single core cooperate to solve the problems as a small pack.

Author:	Date: 2016-12-03 23:57
Agner, what's the latency of the MCDRAM when used as main memory?

Author: Agner	Date: 2016-12-04 23:37
Joe Duarte wrote: what's the latency of the MCDRAM when used as main memory? Approximately 200 clock cycles, I think.

Author: Constantinos Evangelinos	Date: 2016-12-05 16:59
Intel gives 150ns for MCDRAM and 125ns for main memory.

Author: Agner	Date: 2016-12-06 12:20
My measurements of memory latencies are higher. However, I don't have access to control the memory configuration (it is not my machine). My measurements use random memory addresses to avoid prefetching.

Author:	Date: 2016-12-07 15:44
Interesting. Why are we not getting lower latency from these integrated memory modules? They're closer to the processor than DIMM mounted DRAM, yet we never seem to reap any latency reductions. I'm thinking not just of MCDRAM, but also HBM2 and smartphone SOCs.

Author: Agner	Date: 2017-07-13 09:58
That's right. A 'v' in my instruction tables represents a vector register of any size. Intel's Software Developerâ€™s Manual tells which intrinsics correspond to which instructions.

Author: Agner	Date: 2017-10-10 04:05
Peter Cordes wrote: Everything except your table says 1c throughput for INC/DEC, not 0.5c, for both KNL and Silvermont You are right. It will be corrected in the next update. NEG and NOT have double throughput, but not INC and DEC.