Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Test results for Knights Landing - Agner - 2016-11-26
reply Test results for Knights Landing - Nathan Kurz - 2016-11-26
replythread Test results for Knights Landing - Tom Forsyth - 2016-11-27
reply Test results for Knights Landing - Søren Egmose - 2016-11-27
last reply Test results for Knights Landing - Agner - 2016-11-30
replythread Test results for Knights Landing - Joe Duarte - 2016-12-03
replythread Test results for Knights Landing - Agner - 2016-12-04
last reply Test results for Knights Landing - Constantinos Evangelinos - 2016-12-05
last replythread Test results for Knights Landing - John McCalpin - 2016-12-06
replythread Test results for Knights Landing - Agner - 2016-12-06
last reply Test results for Knights Landing - John McCalpin - 2016-12-08
last reply Test results for Knights Landing - Joe Duarte - 2016-12-07
replythread Test results for Knights Landing - zboson - 2016-12-28
last reply VZEROUPPER - Agner - 2016-12-28
replythread Test results for Knights Landing - Ioan Hadade - 2017-07-13
last reply Test results for Knights Landing - Agner - 2017-07-13
last replythread INC/DEC throughput - Peter Cordes - 2017-10-09
last reply INC/DEC throughput - Agner - 2017-10-10
 
Test results for Knights Landing
Author: Agner Date: 2016-11-26 08:39
The Knights Landing is Intel's new "Many Integrated Core" processor. It has 64-72 cores that can run four threads each. It is built with a 14 nm process and runs at a clock frequency of 1.3-1.5 GHz. It is intended for processing large data sets in parallel. It is only useful, of course, if the calculations can easily be split up into multiple threads that can run in parallel.


Each core is lightweight, based on an extension of the Silvermont low power architecture. Each core runs slower than a desktop CPU, but with a large number of cores we can still get a high overall performance. It has 32 kB of level-1 code cache and 32 kB level-1 data cache per core; 1 MB of level-2 cache shared between two cores each; and 16 GB of MCDRAM inside the package. The MCDRAM can be configured as a level-3 cache or as main memory.


The predecessor, Knights Corner, was not very impressive and it had its own instruction set. The Knights Landing is the first processor with the new AVX512 instruction set. It is expected that AVX512 will be the standard for future x86 processors so that the Knights Landing will be binary compatible with mainstream microprocessors. It also supports the previous instructions sets AVX2, etc.


The AVX512 instruction set seems to be quite efficient. It has 32 vector registers of 512 bits each, where AVX2 has only 16 registers of 256 bits each. It also has a new set of eight mask registers that can be used for conditional execution of each element of a vector. Almost all vector instructions can be masked. This works quite efficiently. The latencies and throughputs of vector instructions are the same with or without a mask, and independent of the value of the mask register. The Gnu compiler optimizes this quite well, so that for example an addition and an if can be merged together into a single add instruction with a mask.


On the positive side, the Knights Landing has true out-of-order processing (unlike Knighs Corner and Silvermont). It has a good memory throughput. It can do two 512-bit vector reads, or one read and one write, per clock cycle. The throughput for simple vector instructions is two 512-bit vectors per clock.


The Knights Landing has an instruction set extension, AVX512ER, with some quite impressive math instructions. It can calculate a reciprocal, a reciprocal square root, and even an exponential function, on a vector of 16 floats in just a few clock cycles. The manual has a somewhat confusing description of the accuracy of these instructions. My measurements showed that these instructions are accurate to the last bit on single precision floats, while they give only approximate results for double precision. These instructions are useful for neural networks and other large low-precision math applications.


On the negative side, all vector instructions have a latency of at least 2 clock cycles, where earlier processors have a latency of 1 for simple vector instructions. Integer instructions on general purpose registers have a latency of 1. A possible explanation for this difference is that the integer reservation station can hold source data, while the floating point reservation station cannot. This means that an integer ALU can write its result directly to any subsequent micro-op in the reservation station that needs it, while results in the floating point unit have to go via the floating point register file. The size of the vector operands is simply to large to make it practical to store the values in the reservation station.


Almost all instructions that generate more than one micro-op are microcoded. The performance of microcode is not good. All microcoded instructions take 7 clock cycles or more. This includes most of the legacy x87 floating point instructions. You should avoid legacy x87 code. Floating point division is also relatively slow (32 clock cycles for a vector division).


The instruction decoder is likely to be a bottleneck. It can decode a maximum of two instructions or 16 bytes of code per clock cycle.


When AVX was introduced with 256-bit vector registers, we were told to use the instruction VZEROUPPER to avoid a severe penalty when switching between VEX and non-VEX code. Four generations of Intel processors had such a penalty (Sandy Bridge, Ivy Bridge, Haswell, and Broadwell). AMD processors and later Intel processors (Skylake and Knights Landing) do not have such a state switch. They have no need for the VZEROUPPER. Unfortunately, the VZEROUPPER is quite costly on Knights Landing. The recommendations from Intel are conflicting here. The Intel optimization manual recommends VZEROUPPER when switching between AVX and SSE code, but elsewhere in the same manual the say that you should not use VZEROUPPER on Knights Landing. This conflict is currently not resolved (see my discussion in Intels developer zone).


I am somewhat sceptical about the extensive use of hyperthreading - Intel's word for running multiple threads in the same core. What is the point of running four threads in a CPU core with a limited bandwidth of two instructions per clock cycle? This wouldn't be useful for CPU intensive code, but perhaps for code that is limited by memory access, branch mispredictions, or long dependency chains. Hyperthreading has a hazard that is often ignored. If four threads are running in the same core then each thread gets only a quarter of the CPU resources. I have seen a high-priority thread running at quarter speed because three other low priority threads were running in the same core. This is certainly not an optimal use of resources, and current operating systems are unable to avoid this problem. There is little you can do in a multi-user or multi-process system to prevent low priority threads from stealing resources from high priority threads. It may actually be better to turn off hyperthreading completely in the BIOS setup. There is also a security issue here: One thread will be able to detect what kind of code is running in another thread in the same core by detecting which CPU resources are fully used and which ones are unused.


My optimization manuals have been updated with test results and instruction timings for the Knights Landing and some more general information about AVX512 (link).


My assembly function library has been updated with memcpy, memmove, memset, and memcmp functions optimized for AVX512 (link).


My vector class library has been updated with improved support for AVX512 (link).

   
Test results for Knights Landing
Author:  Date: 2016-11-26 19:07
Thanks for publishing this! Some comments and questions:

> The Knights Landing has full out-of-order capabilities

It may be worth mentioning that the Memory Execution Cluster is still in order. The Intel Architectures Optimization Reference Manual says "The MEC has limited capability in executing uops out-of-order. Specifically, memory uops are dispatched from the scheduler in-order, but can complete in any order. " More specifically, they mean "in order with regard to other memory operations": unlike other modern Xeons, a memory operation with an unfulfilled dependency will block successive memory operations from being dispatched even if their dependencies are ready.

> The reservation stations have 2x12 entries for the integer lines, 2x20 entries
> for the floating point and vector lines, and 12 for the memory lines.

The manual says "The single MEC reservation station has 12 entries, and dispatches up to 2 uops per cycle." I think this means that both 'memory lines' share a single 12 slot queue this is used for both scalar and vector operations? If so, might be good to be more explicit.

> The Knights Landing has two decoders. The maximum throughput is two
> instructions or 16 bytes per clock cycle.

The manual says "The front end can fetch 16 bytes of instructions per cycle. The decoders can decode up to two instructions of not more than 24 bytes in a cycle. ", and then later says "The total length of the instruction bytes that can be decoded each cycle is at most 16 bytes per cycle with instructions not more than 8 bytes in length. For instruction length exceeding 8 bytes, only one instruction per cycle is decoded on decoder 0." I haven't figured out when the 24B limit would apply. Have you? Also, is it correct that these limits are not affected by alignment?

> The throughput is limited to two instructions per clock cycle
> in the decode and register rename stages.
> the average throughput is limited to two μops per clock cycle.
> Read-modify, and read-modify-write instructions generate a single μop
> from the decoders, which is sent to both the memory unit and the execution unit.

By emphasizing that it's a 'single µop', do you mean that the same µop is first sent to the memory unit and then, when the data is ready, (Figure 16-2) sent to the Integer or FP rename buffer as appropriate? And thus since the renamer can handle only two instructions per cycle, this means that unlike other Xeon the use of 'read-modify' (aka 'load-op') instructions does not usually help to increase µop throughput?

> There is no penalty for the 2- and 3-byte VEX prefixes and 4-byte EVEX prefixes
> unless there are additional prefixes before these.

It's fairly easy to check that all the instructions are encoded to less than 8B, but I'm not sure I know how to correctly count the prefixes. Do you have any examples of common cases where this would be a problem?

> The Knights Landing has no loop buffer, unlike the Silvermont. This means
> that the decoding of instructions is a very likely bottleneck, even for small loops.

The corollary to this is that loop unrolling can be a win on KNL even when it would be counterproductive on a Xeon with a loop buffer.

> These are forked after the register allocate and renaming stages into two the integer unit

I think there is a missing word after 'two'?

> A 64-bit register can be cleared by xor'ing the corresponding 32-bit register with itself.

Hmm, interesting find. Do you see any reason they would choose to recognize the 32-bit idiom but not the 64-bit? Or just oversight?

> The processor can do two memory reads per clock cycle or one read and
> one write with vector registers of up to 512 bits. It cannot do two reads with
> general purpose registers in the same clock cycle, but it can do one
> read and one write.

I can't tell from this (or the Intel manual) which "vector plus scalar" memory operations are possible in the same cycle. Do you know if it can read a vector and a scalar in the same cycle? Read one and write the other?

> The latency from the mask register, k1, to the destination is 2 clock cycles.
> The latency from zmm1 input to zmm1 output is 2 clock cycles.

I'm not understanding what you are measuring here. Are you saying that if have an instruction that changes the mask or the input, immediately followed by a masked read, that you will have an effective load latency of 2 + 5? Or something else?

> The {z} option makes the elements in the destination zero, rather than
> unchanged, when the corresponding mask bit is zero. It is recommended
> to use the zeroing option when possible to avoid the dependence on the previous value.

Does this dependency occur only when explicitly using a mask, or does it matter in other cases too? That is, does {z} make a difference without a mask?

> The best solution in such cases may be to turn off hyper-threading in the BIOS setup.

Do you think there is a difference between turning off hyper-threading in BIOS, rather than telling the OS to disable someo f the cores at runtime? The manual says "Hard partitioning of resources changes as logical processors wake up and go to sleep", which makes me think that OS level would be fine, although I'm not sure exactly what it means by 'sleep'.

> 15.11 Bottlenecks in Knights Landing

As shown in your Instruction Table, the extreme cost of some of the "byte oriented" vector instructions might be worth calling out. The ones that stood out for me were VPSHUFB y,y,y with latency 23 and reciprocal throughput 12 (versus 1 and 1 for Skylake) and PMOVMSKB r32,y with latency 26 and reciprocal throughput 12 (versus 2-3 and 1 for Skylake). Specific mention of some of these might be helpful, since while these are still technically supported, it's unlikely that you'd want to use an algorithm that depends on them.

There are a couple passages in the manual that I haven't been able to decipher. Do you perhaps understand this one, which (among other things) pertains to shuffles and permutes? "Some execution units in the VPU may incur scheduling delay if a sequence of dependent uop flow needs to use these execution units, these outlier units are indicated by the footnote of Table 16 2. When this happens, it will have an additional cost of a 2-cycle bubble."

The other line that scares me in the manual is this one: "The decoder will also have a small delay if a taken branch is encountered." Did you happen to figure out how long this delay is?

Thanks again for making this great research available! I'm sure a lot of people will greatly appreciate it.

   
Test results for Knights Landing
Author:  Date: 2016-11-27 00:45
Looks like a typo in the KNL recip-throughput number for FMA - it's currently 3. KNF and KNC get 1, and this chip is a real FMA machine - it's designed around that unit. Pretty sure the correct number for KNL is 0.5 (like VADDPS and VMULPS).

As for why the chip has 4 threads per core - I'm the guy that persuaded KNF (and thus KNC) to have 4 threads, and the reason is first to hide memory misses, second branch mispredicts, and third instruction latencies. Those are all huge bottlenecks in real-world performance. Yes, you can also hide them with huge OOO machines, wide decoders, and long pipelines, but when flops/watt is your efficiency metric, those aren't the first choice. The Knights line of chips already open the book with "we assume you have 70+ threads of stuff to do", and while getting from 1 thread to 4 is agony, and 4 threads to 16 is hard, getting from 70 threads to 280 is actually pretty simple.

   
Test results for Knights Landing
Author: Søren Egmose Date: 2016-11-27 03:13
Considering that OpenPower already supports 8 hardware threads per core this is apparently an acceptable way to go. This also opens a new approach to multithreading where you can have the threads on a single core cooperate to solve the problems as a small pack.
   
Test results for Knights Landing
Author: Agner Date: 2016-11-30 04:33
Tom Forsyth wrote:
Looks like a typo in the KNL recip-throughput number for FMA - it's currently 3. KNF and KNC get 1, and this chip is a real FMA machine - it's designed around that unit. Pretty sure the correct number for KNL is 0.5 (like VADDPS and VMULPS).
You are right. My mistake. Fused multiply-and-add has a throughput of two instructions per clock.

As for why the chip has 4 threads per core - I'm the guy that persuaded KNF (and thus KNC) to have 4 threads, and the reason is first to hide memory misses, second branch mispredicts, and third instruction latencies. Those are all huge bottlenecks in real-world performance. Yes, you can also hide them with huge OOO machines, wide decoders, and long pipelines, but when flops/watt is your efficiency metric, those aren't the first choice.
Thanks for clarifying the reason. Running 4 threads in an in-order core makes sense. Do you think that 4 threads is still useful in the out-of-order KNL?

Nathan Kurz wrote:

The Knights Landing has full out-of-order capabilities

It may be worth mentioning that the Memory Execution Cluster is still in order. The Intel Architectures Optimization Reference Manual says "The MEC has limited capability in executing uops out-of-order. Specifically, memory uops are dispatched from the scheduler in-order, but can complete in any order. " More specifically, they mean "in order with regard to other memory operations":

Memory operations are scheduled in order but executed out of order, as I understand it.

The manual says "The front end can fetch 16 bytes of instructions per cycle. The decoders can decode up to two instructions of not more than 24 bytes in a cycle. ", and then later says "The total length of the instruction bytes that can be decoded each cycle is at most 16 bytes per cycle with instructions not more than 8 bytes in length. For instruction length exceeding 8 bytes, only one instruction per cycle is decoded on decoder 0." I haven't figured out when the 24B limit would apply. Have you? Also, is it correct that these limits are not affected by alignment?
I just tried. A block of two instructions of 24 bytes total can decode in a single clock cycle, but you cannot have consecutive blocks exceeding 16 bytes, and the average cannot exceed 16 bytes per clock. And yes, alignment matters. Decoding is most efficient when aligned by 16. There is probably a double buffer of 2x16 bytes.

Read-modify, and read-modify-write instructions generate a single μop from the decoders, which is sent to both the memory unit and the execution unit.

By emphasizing that it's a 'single µop', do you mean that the same µop is first sent to the memory unit and then, when the data is ready, sent to the Integer or FP rename buffer as appropriate?

Yes, that's how I understand it.

There is no penalty for the 2- and 3-byte VEX prefixes and 4-byte EVEX prefixes unless there are additional prefixes before these.

It's fairly easy to check that all the instructions are encoded to less than 8B, but I'm not sure I know how to correctly count the prefixes. Do you have any examples of common cases where this would be a problem?

You do not normally need any additional prefixes in front of VEX and EVEX prefixes. The only case is FS and GS segment prefixes for the thread environment blocks, and these blocks are usually accessed with integer instructions. But instructions without VEX and EVEX can be a mess of prefixes. Legacy SSSE3 instructions without VEX prefix all have a meaningless 66H prefix and a 2-byte escape code (0FH, 38H). An additional REX prefix is needed if register r8-r15 or xmm8-xmm15 is used. This gives a total of 4 prefix and escape bytes, which is more than the decoder can handle in a single clock cycle. Another example is the ADCX and ADOX instructions with 64-bit registers.

A 64-bit register can be cleared by xor'ing the corresponding 32-bit register with itself.

Hmm, interesting find. Do you see any reason they would choose to recognize the 32-bit idiom but not the 64-bit? Or just oversight?

An optimizing compiler would use xor eax,eax rather than xor rax,rax because the former is shorter. But it may be an oversight. There is no difference in length between xor r8,r8 and xor r8d,r8d.

I can't tell from this (or the Intel manual) which "vector plus scalar" memory operations are possible in the same cycle. Do you know if it can read a vector and a scalar in the same cycle?
Yes, it can do a vector read and a g.p. register read in the same clock cycle. It can also do any combination of a read and a write.

The latency from the mask register, k1, to the destination is 2 clock cycles. The latency from zmm1 input to zmm1 output is 2 clock cycles. I'm not understanding what you are measuring here. Are you saying that if have an instruction that changes the mask or the input, immediately followed by a masked read, that you will have an effective load latency of 2 + 5? Or something else?
The latencies come in parallel, not in series, so the latency from a mask or destination register will be 2 if the memory operand is ready.

The {z} option makes the elements in the destination zero, rather than unchanged, when the corresponding mask bit is zero. It is recommended to use the zeroing option when possible to avoid the dependence on the previous value. Does this dependency occur only when explicitly using a mask, or does it matter in other cases too? That is, does {z} make a difference without a mask?
A masked instruction has a dependency on the destination register if there is a mask and no {z}. There is no dependency on the destination register if there is no mask or if there is a mask with a {z}. It doesn't make sense to put {z} when there is no mask.

The best solution in such cases may be to turn off hyper-threading in the BIOS setup. Do you think there is a difference between turning off hyper-threading in BIOS, rather than telling the OS to disable some of the cores at runtime?
There is no difference. The problem is that current Operating Systems are not handling hyper-threading optimally. In fact, the O.S. lacks the necessary information. The CPUID instruction can tell how many threads are sharing the same L1 or L2 cache, but it does not tell which threads are sharing a cache, and it does not tell if threads are sharing decoders, execution units, and other resources. The proper solution would be to make the CPUID instruction tell which resources are shared between which threads, and make the operating system give high priority threads an unshared core of their own. But still, the O.S. does not know what is the bottleneck in each thread. It may be OK to put two threads in the same core if the bottleneck is memory access, but not if the bottleneck is instruction decoding. The application programmer or the end user cannot control this either if programs are running in a multiuser system. The hardware designers have given an impossible task to the software developers.

Do you perhaps understand this one, which (among other things) pertains to shuffles and permutes? "Some execution units in the VPU may incur scheduling delay if a sequence of dependent uop flow needs to use these execution units, these outlier units are indicated by the footnote of Table 16 2. When this happens, it will have an additional cost of a 2-cycle bubble"
There is an extra latency of 1-2 clock cycles when, for example, the output of an addition instruction goes to the input of a shuffle instruction. The data have to travel a longer distance to these units that they call outliers. There is less latency when the output of a shuffle instruction goes to the input of another shuffle instruction.

The decoder will also have a small delay if a taken branch is encountered. Did you happen to figure out how long this delay is?
A taken branch has a delay of 2 clock cycles.
   
Test results for Knights Landing
Author:  Date: 2016-12-03 23:57
Agner, what's the latency of the MCDRAM when used as main memory?
   
Test results for Knights Landing
Author: Agner Date: 2016-12-04 23:37
Joe Duarte wrote:
what's the latency of the MCDRAM when used as main memory?
Approximately 200 clock cycles, I think.
   
Test results for Knights Landing
Author: Constantinos Evangelinos Date: 2016-12-05 16:59
Intel gives 150ns for MCDRAM and 125ns for main memory.
   
Test results for Knights Landing
Author:  Date: 2016-12-06 11:19
The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.

It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)

The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal). For operation in "Flat" mode (MCDRAM as memory, located in the upper 16 GiB of the physical address space), with the coherence agent mapping in "Quadrant" mode (addresses are hashed to coherence agents spread across the entire chip, but each cache line is assigned to an MCDRAM controller in the same "quadrant" as the CHA responsible for coherence), my preferred latency tester gives values of 154ns +/- 1ns (1 standard deviation) for MCDRAM. These values are averaged over many addresses, with the variation mostly from core to core (with a few ns of random variability). My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.

For the same system in "Flat" "All-to-All" mode (addresses are hashed to coherence agents spread across the entire chip, with no special correlation between the location of coherence agents and the MCDRAM controller owning an address), the corresponding value is 156ns +/- 1ns (1 standard deviation).

For the same system in "Flat" "Sub-NUMA Cluster 4" mode, the corresponding values are 150.5ns +/- 0.9ns (1 standard deviation) for "local" accesses, and 156.8ns +/- 3.1ns for "remote" accesses. Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles. (Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.) Run-to-run variability is typically small when using large pages, but there are certain idiosyncrasies that have yet to be explained.

Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of "hops" required for coherence transactions in "Quadrant" and "SNC-4" modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available....

   
Test results for Knights Landing
Author: Agner Date: 2016-12-06 12:20
My measurements of memory latencies are higher. However, I don't have access to control the memory configuration (it is not my machine). My measurements use random memory addresses to avoid prefetching.
   
Test results for Knights Landing
Author:  Date: 2016-12-08 12:21
I have done testing with random permutations and with the hardware prefetchers disabled (and with both at the same time), and the simple stride results with no HW PF match the permuted results with HW PF enabled once the permutation block gets big enough.
I did these tests back in July, and we have changed a number of aspects of the system configuration since then, but I think that
Transparent Huge Pages were enabled when I did these tests. I don't recall if this was before or after we disabled some of the C states. The "untile" frequency may also make a difference -- it automatically ramps up to full speed when running bandwidth tests, but when running latency tests the Power Control Unit may not think that the "untile" is busy enough to justify ramping up the frequency.

Without knowledge of the tag directory hash, the processor placement, the MCDRAM hash, etc, it is challenging to make a lot of sense of the results. On KNC the RDTSC instruction had about a 5 cycle latency, so I was able to do a lot more with timing individual loads, and the single-ring topology made the analysis easier.

There are more performance counters in the "untile" on KNL, but there is no documentation on the various box numbers are located on the mesh. There is some evidence that the CHA box numbers are re-mapped -- on a 32-tile/64-core Xeon Phi 7210 all 38 CHAs are active, but the six CHA's with anomalous readings are numbered 32-37. The missing core APIC IDs are not bunched up in this way.

The stacked memory modules have slightly higher latency because they are typically run in "closed page" mode, and because there is an extra set of chip-to-chip crossings. HMC (and Intel's MCDRAM) have an extra SERDES step between the memory stack and the processor chip. There are many different approaches used to error-checking on SERDES, but it is probably safe to expect that error-checking will require at least some added latency.

   
Test results for Knights Landing
Author:  Date: 2016-12-07 15:44
Interesting. Why are we not getting lower latency from these integrated memory modules? They're closer to the processor than DIMM mounted DRAM, yet we never seem to reap any latency reductions. I'm thinking not just of MCDRAM, but also HBM2 and smartphone SOCs.
   
Test results for Knights Landing
Author:  Date: 2016-12-28 02:44
I'm confused by your statement:
"AMD processors and later Intel processors (Skylake and Knights Landing) do not have such a state switch. They have no need for the VZEROUPPER"

From this discussion

stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake

it appears that VZEROUPPER is even more important in some situations with Skylake then e.g. with Haswell.

   
VZEROUPPER
Author: Agner Date: 2016-12-28 03:57
zboson wrote:
it appears that VZEROUPPER is even more important in some situations with Skylake then e.g. with Haswell.


I just made some experiments on a Haswell. The dirty state makes every non-VEX vector operation dependent on the previous value of the destination register. For example

vpor ymm1,ymm2,ymm3
...
pxor xmm4,xmm4 ; Has false dependence on ymm4

I have no access to a Skylake at the moment, but it may behave the same. This means that VZEROUPPER is still needed on all Intel processors, except Knights Landing.

   
Test results for Knights Landing
Author:  Date: 2017-07-13 04:00
As always, many thanks for your work with the instruction timings and evaluations, they are truly helpful.

I have one question regarding the permutations on KNL. As you can see in your instruction manual and corroborated in "Intel Xeon Phi Processor High Performance Programming", there is around 2X difference in throughput and 33% in latency when using a two source or one source permutation/shuffle with the one source variant being superior. I am interested in permutations on double precision operands to be more precise.

Scanning through your instruction table, I see that you have added vpermpd as only working on ymm registers which presumably is a one source permutation. However, in the intrinsics guide, I can see that you can generate a vpermpd using zmm's with _mm512_mask_permutex_pd where you can use a mask to keep results from the source argument (which I think gets overwritten) and shuffles across 256bit lane in the other operand. This decays to vpermpd zmm {k}, zmm, imm. As far as I can tell, this is a one source operand permutation which will have a latency between 3-6 cycles and a rec throughput of 1, am I correct? From my understanding, the two source operands are the intrinsics that decay into vpermpd zmm {k}, zmm, zmm such as _mm512_maskz_permutexvar_pd or the VSHUFF64X2 instructions.

Your clarification between these would be very helpful.

Kind regards,
Ioan

   
Test results for Knights Landing
Author: Agner Date: 2017-07-13 09:58
That's right.
A 'v' in my instruction tables represents a vector register of any size.
Intel's Software Developer’s Manual tells which intrinsics correspond to which instructions.
   
INC/DEC throughput
Author:  Date: 2017-10-09 23:52
Everything except your table says 1c throughput for INC/DEC, not 0.5c, for both KNL and Silvermont, and Intel's manual explains why.

InstLatx64 for Silvermont says 1 per clock INC r32 throughput, but 0.5c ADD r32, imm8. Same for Goldmont (except that ADD r32,imm8 is 0.33c).

I haven't found InstLatx64 results for actual KNL, but Intel's optimization manual describes KNL as having the same flag-merging extra uop for INC/DEC as Silvermont, so I expect that INC is worth avoiding on KNL as well when you're bottlenecked on issue throughput instead of decode. (Is there any IDQ between decode and issue to let the expansion from 1 uop decoded / 2 uops issued fill decode bubbles?) If so, it could still have an apparent cost of 0.5c in the right circumstances.

According to Intel's optimization manual, 17.1.2 Out-of-Order Engine (KNL):

Additionally, some instructions in the Knights Landing microarchitecture will be decoded as one uop by
the front end but need to expand to two operations for execution. These complex uops will have an allocation
throughput of one per cycle. Examples of these instructions are:


  • POP: integer load data + ESP update, PUSH: integer store data + ESP update.
  • INC/DEC: add to register + update partial flags
  • Gather: two VPU uops
  • CALL / RET: JMP + ESP update
  • LEA with 3 sources

(lightly edited the list to group related instructions better.)

This means there's *always* a flag-merging uop, even when nothing reads INC's flags. And it explains why the sustained throughput of INC/DEC is only 1, not 0.5, according to measurements other than yours, and according to Intel's published tables. It's nice that the integer register update itself doesn't have a false dep on flags, though, so they made it a lot less bad than P4.

   
INC/DEC throughput
Author: Agner Date: 2017-10-10 04:05
Peter Cordes wrote:
Everything except your table says 1c throughput for INC/DEC, not 0.5c, for both KNL and Silvermont
You are right. It will be corrected in the next update. NEG and NOT have double throughput, but not INC and DEC.