Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Test results for Broadwell and Skylake - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake - Jess - 2016-02-11
last reply Description of discrepancy - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake - T - 2016-06-18
reply Instruction Throughput on Skylake - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake - - - 2017-06-19
replythread Test results for Broadwell and Skylake - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake - - - 2017-07-05
last replythread Test results for Broadwell and Skylake - - - 2017-07-12
last reply Test results for Broadwell and Skylake - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake - Travis - 2017-07-13
last reply Official information about uOps and latency SNB+ - SEt - 2017-07-17
 
Test results for Broadwell and Skylake
Author: Agner Date: 2015-12-26 08:27
The optimization manuals at www.agner.org/optimize/#manuals have been updated. I have now tested the Intel Broadwell and Skylake processors. I have not tested the AMD Excavator and Puma because I cannot find suitable motherboards for testing them.

The test results show that the pipeline and execution units in Broadwell is very similar to its predecessor Haswell, while the Skylake has been reorganized a little.

The Skylake has a somewhat improved cache throughput and supports the new DDR4 RAM. This is important since RAM access is the bottleneck in many applications. On the other hand, the Skylake has reduced the level-2 cache associativity from 8 to 4.

Floating point division has been improved a little in Broadwell and integer division has been improved a little in Skylake. Gather instructions, which are used for collecting non-contiguous data from memory and joining them into a vector register, are improved somewhat in Broadwell, and a little more in Skylake. This makes it more efficient to collect data into vector registers.

Ever since the first Intel processor with out-of-order execution was released in 1995, there has been a limitation that no micro-operation could have more than two input dependencies. This meant that instructions with more than two input dependencies were split into two or more micro-operations. The introduction of fused multiply-and-add (FMA) instructions in Haswell made it necessary to overcome this limitation. Thus, the FMA instructions were the first instructions to be implemented with micro-operations with three input dependencies in an Intel processor. Once this limitation has been broken, the new capability can also be applied to other instructions. The Broadwell has extended the capability for three-input micro-operations to add-with-carry, subtract-with-borrow and conditional move instructions. The Skylake has extended it further to a blend instruction. AMD processors have never had this limitation of two input dependencies. Perhaps this is the reason why AMD came before Intel with FMA instructions.

The Haswell and Broadwell have two execution units for floating point multiplication and FMA, but only one for addition. This is odd since most floating point code has more additions than multiplications. To get the maximum floating point throughput on these processors, one might have to replace some additions with FMA instructions with a multiplier of 1. Fortunately, the Skylake has fixed this imbalance and made two floating point arithmetic units, both of which can handle both addition, multiplication and FMA. This gives a maximum throughput of two floating point vector operations per clock cycle.

The Skylake has increased the number of execution units for integer vector arithmetic from two to three. In general, the Skylake now has multiple execution units for almost all common operations (except memory write and data permutations). This means that an instruction or micro-operation rarely has to wait for a vacant execution unit. A throughput of four instructions per clock cycle is now a realistic goal for CPU-intensive code, unless the software contains long dependency chains. All arithmetic and logic units support vectors of up to 256 bits. The anticipated support for 512-bit vectors with the AVX-512 instruction set has been postponed to 2016 or 2017.

Intel's design has traditionally tried to standardize operation latencies, i. e. the number of clock cycles that a micro-operation takes. Operations with the same latencies were organized under the same execution port in order to avoid a clash when operations that start at different times would finish at the same time and so need the result bus at the same time. The Skylake microarchitecture has been improved to allow operations with several different latencies under the same execution port. There is still some standardization of latencies left, though. All floating point additions, multiplications and FMA operations have a latency of 4 clock cycles on Skylake, while these were 3 and 5 on previous processors.

Store forwarding is one clock cycle faster on Skylake than on previous processors. Store forwarding is the time it takes to read from a memory address immediately after writing to the same address.

Previous Intel processors have different states for code that use the AVX instruction sets allowing 256-bit vectors versus legacy code with 128-bit vectors and no VEX prefixes. The Sandy Bridge, Ivy Bridge, Haswell and Broadwell processors all have these states and a serious penalty of 70 clock cycles for state switching when a piece of code accidentally mixed VEX and non-VEX instructions. This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector. The result is that the throughput for 256-bit vectors is 4-5 times slower during this warm-up period. If you know in advance that you will need to use 256-bit instructions soon, then you can start the warm-up process by placing a dummy 256-bit instruction at a strategic place in the code. My measurements showed that the upper half of the units is shut down again after 675 µs of inactivity.

This warm-up phenomenon has reportedly been observed in previous processors as well (see agner.org/optimize/blog/read.php?i=378#378), but I have not observed it before in any of the processors that I have tested. Perhaps some high-end versions of Intel processors have this ability to shut down the upper 128-bit lane in order to save power, while other variants of the same processors have no such feature. This is something that needs further investigation.

   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-26 18:03
Hi Agner --

Great to see the updates for Skylake! Thanks for putting all the effort into making these. Your guides are tremendous resources.

You mention in your guides that bank conflicts should no longer be a problem for Haswell or Skylake, and that "There are two identical memory read ports (port 2 and 3) and one write port (port 4). These ports all have the full 256 bits width. This makes it possible to make two memory reads and one memory write per clock cycle, with any register size up to 256 bits.". You also say that cache bank conflicts are not a problem, and that "It is always possible to do two cache reads in the same clock cycle without causing a cache bank conflict."

Do you have code that demonstrates this? Even without writes, I'm currently unable to create code that can sustain 2 256-bit loads per cycle from L1D. I started with code that used a fused-multiply-add, but then realized that I was being slowed down by the loads rather than the math. I'm also seeing timing effects that make me suspect that some sort of bank conflict much be occurring, since some orderings of loads from L1 are consistently faster than others. I've put my current test code up here: https://gist.github.com/nkurz/9a0ed5a9a6e591019b8e

When compiled with "gcc -fno-inline -std=gnu99 -Wall -O3 -g -march=native l1d.c -o l1d", results look like this on Haswell:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 5.01 cycles/input
calc_fma(array1, array2, size): 0.22 cycles/input
calc_fma_reordered(array1, array2, size): 0.20 cycles/input
calc_load_only(array1, array2, size): 0.21 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.18 cycles/input [ERROR]

And like this on Skylake:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 4.02 cycles/input
calc_fma(array1, array2, size): 0.20 cycles/input
calc_fma_reordered(array1, array2, size): 0.17 cycles/input
calc_load_only(array1, array2, size): 0.20 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.17 cycles/input [ERROR]


calc_simple() shows that the latency of an FMA on Haswell is 5 cycles, while it's only 4 cycles on Skylake. It's a simple approach in that there is no unrolling, so we are latency limited. So far, so good.

calc_fma() shows a straightforward approach of loading 4 YMM vectors of floats, and then multiplying them by another 4 YMM vectors of floats, using 4 separate accumulators. Results are slightly slower on Haswell than on Skylake, presumably because 4-way unrolling is not enough to hide the 5 cycle latency of the FMA on Haswell.

calc_fma_reordered() is the first surprise. This is the same as calc_fma(), but loads the vectors in a different order: +96, +32, +64, +0 instead of the in-order byte offsets of +0, +32, +64, +96. I haven't seen any theory that would explain why there would be a difference in speed for these two orders.

calc_load_only() is the next surprise. I dropped the FMA altogether, and just did the loads. We get a slight speed up on Haswell (agreeing with the FMA latency), but no speed up on Skylake. Since there is nothing in the loop but the loads, if we can execute 2 32B loads per cycle, I would have expected to see .125 cycles per input. The [ERROR] on the line is expected, and is because we are not actually calculating the sum.

calc_load_only_reordered() continues the surprise. Once again, reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle. Again, [ERROR] is expected because their is no math being done.

Do you have any idea what's happening here? Why would the ordering of the loads matter if all the results are in L1D? Why can't I get to .125 cycles per float? I've inspected the results with 'perf record -F 10000 ./l1d' / 'perf report' on both machines, and the assembly looks like I'd expect. I can make the loop logic slightly better, but this doesn't seem to be the limiting factor. What do I need to do differently to achieve sustained load speeds of 64B per cycle on Haswell and Skylake?

   
Sustained 64B loads per cycle on Haswell & Sky
Author: Agner Date: 2015-12-27 01:48
Nathan Kurz wrote:
reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle.
It is possible to make two reads and one write in the same clock cycle, but it is not possible to obtain a continuous throughput at this theoretical maximum. You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc. The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7. It is quite likely that there are other effects that I am not aware of. The execution times that I have measured for 2 reads and 1 write are fluctuating a lot, and typically 40 - 60 % longer than the theoretical minimum.
   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-27 18:59
Agner wrote:
You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc.
Yes, although in my example I'm considering the much simpler case where there are two reads but no writes, and all data is already in L1. So although problematic in the real world, these shouldn't be a factor here. In fact, I see the same maximum speed if I read the same 4 vectors over and over rather than striding over all the data. I've refined my example, though, and think I now understand what's happening. The problem isn't a bank conflict, rather it's a slowdown due to unaligned access. I don't think I've seen this discussed before.

Contrary to my previous understanding, alignment makes a big difference on the speed at which vectors are read from L1 to register. If your data is 16B aligned rather than 32B aligned, a sequential read from L1 is no faster with 256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot achieve 2 32B loads per cycle if the underlying data is not 32B aligned. If the data is 32B aligned, you still can't quite sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle with both.

I put up new test code here: https://gist.github.com/nkurz/439ca1044e11181c1089

Results at L1 sizes are essentially the same on Haswell and Skylake.

Loading 4096 floats with 64 byte raw alignment
Vector alignment 8:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.41 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 16:
load_xmm : 29.26 bytes/cycle
load_xmm_nonsequential : 29.05 bytes/cycle
load_ymm : 28.44 bytes/cycle
load_ymm_nonsequential : 36.90 bytes/cycle

Vector alignment 24:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.54 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 32:
load_xmm : 29.05 bytes/cycle
load_xmm_nonsequential : 28.85 bytes/cycle
load_ymm : 53.19 bytes/cycle
load_ymm_nonsequential : 52.51 bytes/cycle

What this says is that unless your loads are 32B aligned, regardless
of method you are limited to about 40B loaded per cycle. If you are
sequentially loading non-32B aligned data from L1, the speeds for 16B
loads and 32B loads are identical, and limited to less than 32B per
cycle. All alignments not shown were the same as 8B alignment.

Loading in a non-sequential order is about 20% faster for unaligned
XMM and unaligned YMM loads. It's possible there is a faster order
than I have found so far. Aligned loads are the same speed
regardless of order. Maximum speed for aligned XMM loads is about 30
B/cycle, and maximum speed for aligned YMM loads is about 54 B/cycle.

At L2 sizes, the effect still exists, but is less extreme. XMM loads
are limited to 13-15 B/cycle on both Haswell and Skylake. On Haswell,
YMM non-aligned loads are 18-20 B/cycle, and YMM aligned loads are
24-26 B/cycle. On Skylake, YMM aligned loads are slightly faster at
27 B/cycle. Interestingly, sequential unaligned L2 loads on Skylake
are almost the same as aligned loads (26 B/cycle), while non-sequential
loads are much slower (17 B/cycle).

At L3 sizes, alignment is barey a factor. On Haswell, all loads are
limited to 11-13 B/cycle. On Skylake, XMM loads are the same 11-13
B/cycle, while YMM loads are slightly faster at 14-17 B/cycle.

Coming from memory, XMM and YMM loads on Haswell are the same
regardless of alignment, at about 5 B/cycle. On Skylake, XMM loads
are about 6.25 B/cycle, and YMM loads are about 6.75 B/cycle, with
little dependence on alignment. It's possible that prefetch can
improve these speeds slightly.

The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7.
I don't recall if you mention it in your manuals, but I presume you are aware that Port 7 on Haswell and Skylake is only capable of "simple" address calculations? Thus sustaining 2 loads and a store is only possible if the store address is [const + base] form rather than [const + index*scale + base]. And as you point out, even if you do this, it can still be difficult to force the processor to use only Port 7 for the store address.
   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2016-01-04 07:21
Thanks to Nathan Kurz for the interesting test code.

I was able to reproduce the results on a Xeon E5-2660 v3 system once I pinned the core frequency to match the nominal frequency (2.5 GHz on that system).

It looks like the results are actually a bit better than reported because the tests are short enough that the timer overhead is not negligible. I modified the code to print out the "cycle_diff" variable in each case and see that the fastest tests are only about 312 cycles. RDTSCP overhead on this system is 32 cycles (for my very similar inline assembly), which suggests that the loop is only taking about 280 cycles. This raises the estimate of the throughput from 52.5 Bytes/cycle to 52.5*312/280 = 58.5 Bytes/cycle. This is 91.4% of peak, which is almost as fast as the best results I have been able to obtain with a DDOT kernel.

For my DDOT measurements, I ran a variety of problem sizes and did a least-squares fit to estimate the slope and intercept of the cycle count as a function of problem size. This gave estimated slopes corresponding to up to ~95% of 64 Bytes/cycle. (I used this approach because I was reading not only the TSC, but up to 8 PMCs as well, and the total overhead became quite large -- well over 200 cycles.)

In my experience, it is exceedingly difficult to understand performance limiters once you have reached this level of performance -- even if you are on the hardware engineering team! As a rule of thumb, anything exceeding 8/9 (88.9%) of the simple theoretical peak is pretty close to asymptotic, and exceeding 16/17 (94.1%) of peak is extremely uncommon.

   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2016-06-18 20:32
The aligned vs unaligned results make intuitive sense. In recent processors, the penalty for unaligned access has been made faster: the penalty went to zero on Sandy Bridge (and perhaps earlier), at least for loads that didn't cross a 64B cache-line boundary. In Haswell, even the 64B latency penalty disappeared - although only for loads, not stores. You can see this all graphically here:

blog.stuffedcow.net/2014/01/x86-memory-disambiguation/

The 2D charts are trying to get at the penalty of store-to-load forwarding, but the cells off of the main diagonal do a great job of showing the unaligned load/store penalties as well.

So you are finding that unaligned loads *still* have a penalty, even on Skylake - right? The key is loads that cross a 64B boundary. Fundamentally that requires bringing in two different lines from the L1, and merging the results so you get a word composed of some one line and some of another. The improvements culminating in Haswell reduced the latency of this operation to the point where it fits inside the standard 4 cycle latency for ideal L1 access, but it can't avoid the double bandwidth usage of the unaligned loads. In many algorithms, the maximum bandwidth of the L1 isn't approached (i.e,. the loads-per-cycle are 1 or less), so unaligned access ends up the same as aligned. In your loop, however, you do saturate the load bandwidth, so loads that cross a 64B boundary will cut your throughput in half, or worse.

It doesn't explain the results you got by inverting the load order, but perhaps some of that can be explained by how the loads "pair up". That is, two aligned loads can pair up in the same cycle since each only needs 1 of the 2 "load paths" from L1. An unaligned load needs both, however. So if you have a load pattern like AAUUAAUU (where A is an aligned load and U is unaligned) you get:

cycle loads
0 AA
1 U
2 U
3 AA
4 U
5 U
...

So you get 4 loads every 3 cycles, because the aligned loads are always able to pair.

On the other hand, if you have a load pattern like AUAUAUAUA, you get the following:

cycle loads
0 A
1 U
2 A
3 U
....

I.e., only 3 loads every 3 cycles, or a 25% penalty to throughput, because the aligned loads end up being singletons as well. You might ask why OoO wouldn't solve this - well OoO is based on the scheduler which understands instruction dependencies, and has a few other special-case tricks to re-order things (e.g,. to avoid port retirement conflicts), but otherwise still does stuff in-order. So likely can't understand that it should try to reorder the loads to pair aligned loads. Furthermore the memory model imposes restrictions on reodering loads (but I don't fully grok how this actually falls out in practice when you consider load buffers and the coherency protocol and so on).

All that to say that reordering the loads might easily swap the behavior from an AAUU behavior to an AUAU one.

   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2017-01-12 02:40
Nathan Kurz wrote:

...

The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7.
I don't recall if you mention it in your manuals, but I presume you are aware that Port 7 on Haswell and Skylake is only capable of "simple" address calculations? Thus sustaining 2 loads and a store is only possible if the store address is [const + base] form rather than [const + index*scale + base]. And as you point out, even if you do this, it can still be difficult to force the processor to use only Port 7 for the store address.
I know I am really late in response to this but I think that Skylake can be "hinted" somewhat on the use of port 7 - at least for GPR based code. Consider the following (which is a core loop for a long addition routine)

.Loop:

mov Limb0, [Op1] ;1 1 p23 2 0.5
adc Limb0, [Op2] ;2 2 p06 p23 1
mov [Op3], Limb0 ;1 2 p237 p4 3 1
mov Limb1, [Op1+8] ;1 1 p23 2 0.5
adc Limb1, [Op2+8] ;2 2 p06 p23 1
mov [Op3+8], Limb1 ;1 2 p237 p4 3 1
mov Limb2, [Op1+16] ;1 1 p23 2 0.5
adc Limb2, [Op2+16] ;2 2 p06 p23 1
mov [Op3+16], Limb2 ;1 2 p237 p4 3 1
mov Limb3, [Op1+24] ;1 1 p23 2 0.5
adc Limb3, [Op2+24] ;2 2 p06 p23 1
mov [Op3+24], Limb3 ;1 2 p237 p4 3 1

mov Limb0, [Op1+32] ;1 1 p23 2 0.5
adc Limb0, [Op2+32] ;2 2 p06 p23 1
mov [Op3+32], Limb0 ;1 2 p237 p4 3 1
mov Limb1, [Op1+40] ;1 1 p23 2 0.5
adc Limb1, [Op2+40] ;2 2 p06 p23 1
mov [Op3+40], Limb1 ;1 2 p237 p4 3 1
mov Limb2, [Op1+48] ;1 1 p23 2 0.5
adc Limb2, [Op2+48] ;2 2 p06 p23 1
mov [Op3+48], Limb2 ;1 2 p237 p4 3 1
mov Limb3, [Op1+56] ;1 1 p23 2 0.5
adc Limb3, [Op2+56] ;2 2 p06 p23 1
mov [Op3+56], Limb3 ;1 2 p237 p4 3 1

lea Op1, [Op1+64] ;1 1 p15 1 0.5
lea Op2, [Op2+64] ;1 1 p15 1 0.5
lea Op3, [Op3+64] ;1 1 p15 1 0.5

.Check:

dec Size1
jne .Loop

On my Skylake system it executes in 817 cycles for Size1=683 (measured with RDTSCP). If I insert a "vpblend YMM0, YMM0, YMM0, 0" after "mov [Op3], Limb0" the execution time goes down to 698 cycles repeatedly! This seems to imply that port 7 is allways correctly used for the write. So far I haven't tried if a similar scheme - inserting a carefully choosen GPR opcode inside a AVX2 loop - yields similar results.

   
Test results for Broadwell and Skylake
Author:  Date: 2015-12-28 06:19
Thanks for your excellent work on the instruction tables and microarchitecture guide.

Agner wrote:

This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I wonder if the performance penalty has been replaced with a power-consumption penalty. Perhaps there's still a "state C" where Skylake uses more power? The performance penalty on the earlier CPUs ensures most software will still avoid this. I don't think this is very likely; probably they came up with some clever way to avoid penalties except maybe when forwarding results from a non-VEX op to a 256b op (over the bypass network).

Do 128b non-VEX ops have a "false" dependency on the upper128 of a register? Is there a latency penalty when a 256b insn reads a ymm register last written by a non-VEX insn (or an extra uop to merge the xmm into the ymm)?

More importantly, is VZEROUPPER helpful in any way on Skylake? (Obviously this is a bad idea for binaries that might be run on older CPUs).

There is one use-case for mixing VEX and non-VEX : PBLENDVB x,x,xmm0 is 1 uop, p015. VPBLENDVB v,v,v,v is 2 uops, 2p015, and 2c latency. I'm picturing a function that needs to do a lot of blends, and but can also benefit from using 3-operand non-destructive VEX insns, except for non-VEX PBLENDVB.

Also: I remember reading something in a realworldtech forum thread about wider uop fetch in Skylake. (The forum isn't searchable, so I prob. can't find it now). Is there any improvement in the frontend for loops that don't fit in the loop buffer? I was hoping Skylake would fetch whole uop cache lines (up to 6 uops) per clock, and put them into a small buffer to more consistently issue 4 fused-domain uops per clock.

I've considered trying to align / re-ordering insns for uop-cache throughput in a loop that didn't quite fit in the loop buffer. I saw performance differences (on SnB) from reordering, but I never went beyond trial and error. I don't have an editor that shows the assembled binary updated on the fly as source edits are made, let alone with 32B boundaries marked and uops grouped into cache lines, so it would have been very time consuming.

   
Test results for Broadwell and Skylake
Author: Agner Date: 2015-12-29 01:36
Peter Cordes wrote:
Perhaps there's still a "state C" where Skylake uses more power?
I find no evidence of states, and I don't think it requires more power. The 128/256-bit vectors are probably treated somewhat like 8/16/32/64 bit general purpose registers.
Do 128b non-VEX ops have a "false" dependency on the upper128 of a register? Is there a latency penalty when a 256b insn reads a ymm register last written by a non-VEX insn (or an extra uop to merge the xmm into the ymm)?
There is false dependency and 1 clock extra latency, but no extra µop seen in the counters. I see no difference in the clock counts here whether the 128-bit instruction has VEX prefix or not.
   
Test results for Broadwell and Skylake
Author:  Date: 2016-01-04 15:04
Hello, Agner. Thanks for detailed work, but there is some strangeness in the results, what looks like mistakes. Here's 2 examples:
For Haswell — «MOVBE r64,m64» is a 3-mop instruction with TP of 0.5 CPI (2 IPC), which is impossible with 4 IPC total pipeline restriction. AIDA64 readout (see instlatx64.atw.hu ) shows 1 IPC here.
For Skylake — «PMUL* (v,)v,v» is a 1-mop instruction with only 1 IPC, despite 2 ports available for execution (p01). AIDA64 shows TP of 2 IPC (0.5 CPI) because of second integer multiplier.
There are more minor mistakes elsewhere.
   
Test results for Broadwell and Skylake
Author: Agner Date: 2016-01-05 13:16
You are right.
The throughput for MOVBE r64,m64 is 4 instructions per 3 clock cycles.
The throughput for integer vector multiplication instructions and several other integer vector instructions is 2 instructions per clock for 128-bit and 256-bit registers, but 1 instruction per clock for 64-bit registers, because port 0 supports these instructions for all vector sizes, while port 1 supports the same instructions only for 128-bit and 256-bit vectors.
   
Test results for Broadwell and Skylake
Author:  Date: 2016-03-09 20:58
More stuff. Have you measured total T-put of immediate data? AIDA64 readout is inconsistent and may be erroneous. Things to consider:
1) Legacy decoder should have different T-put than µop-cache; IDQ queue may or may not impose it's own restrictions.
2) As it is known for SB and IB (but may not be true for Haswell and newer CPUs; would be cool to test all of them), µop-cache slot has 4 bytes of data for both imm and ofs fields; so if (there is 8-byte const) or (total length of imm and ofs consts is >4 bytes) — 2 entries are allocated for that µop. Literal pool in scheduler may have it's own restrictions in port number (3…6) and width (4 or 8 bytes).
3) Instructions of interest:
—MOV r32/64,imm32/64 : 4/8 bytes of literals per instruction with 4 IPC of max. T-put (ideally should be 16/32 bytes/cl.);
—ADD r32,imm32 : 4 bytes of literals per instruction with 4 IPC of max. T-put;
—BLENDPS/PD xmm,[r+ofs32],imm8 : 5 bytes of total literals per instruction with 3 IPC of max. T-put, but only 2 L1D reads/cl.; may substitute 3-rd blend with MOVAPS [r+ofs32],xmm , having 5+5+4=14 bytes of literals for 3 IPC (but 5 µops).
   
Test results for Broadwell and Skylake
Author:  Date: 2016-06-05 15:26
Intel's Optimisation Manual says certian things on Skylake's OoO-machine updates:
1. «Legacy decode pipeline» can deliver 5 µops/cl to IDQ, 1 more than before;
2. DSB can deliver 6 µops/cl to IDQ, 2 more than before;
3. There are 2 IDQ's (1 per thread) 64 µops each; all 64 can be used for a loop (in both threads);
4. Improved SMT performance with HT on; by longer latency Pause instruction and/or by wider Retire.

All of this is contradictory with your results about Skylake. Or was that info related only to Bwl?

   
Minor bug in the microarchitecture manual
Author:  Date: 2016-01-10 13:05
Hi Agner, thanks a lot for your manuals, they're an invaluable source, even better then the official ones.

I've noticed a small error in microarchitecture.pdf. At pag.148 (description of Skylake's pipeline), you say that "The sizes of the reorder buffer, reservation station and register file have allegedly been increased, but the details have not been published".
Their sizes have been publishes (224 slots for the ROB, 97 RS entries, 180 PREGS, and so on), you can view them on pag.12 of this presentation from IDF15 (it's the SPCS001 session)

https://hubb.blob.core.windows.net/e5888822-986f-45f5-b1d7-08f96e618a7b-published/73ed87d8-209a-4ca1-b456-42a167ffd0bd/SPCS001%20-%20SF15_SPCS001_103f.pdf?sv=2014-02-14&sr=c&sig=XKetbBtWcJzdBjJEc1bFubMzOrEPpoVcK6%2Bm693ZUts%3D&se=2016-01-11T18%3A50%3A10Z&sp=rwd

Thanks again and keep up with the good work!

   
Minor bug in the microarchitecture manual
Author: Agner Date: 2016-01-16 03:26
Thanks for the tip. The link doesn't work. I found it here: myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5 session SPCS001.
   
Test results for Broadwell and Skylake
Author:  Date: 2016-01-12 13:54
I just ran across some performance counter bugs on Haswell that may influence one's interpretation of instruction retirement rates and may bias measurements of uops per instruction.

I put performance counters around 100 (outer) iterations of a simple 10-instruction loop that executed 1000 times. According to Agner's instruction tables this loop should have 12 uops. Both the fixed-function "instructions retired" and the programmable "INST_RETIRED.ANY_P" events report 12 instructions per loop iteration (not 10), while the UOPS_RETIRED.ALL programmable counter event reported 14 uops per loop iteration (not 12). While I could be misinterpreting the uop counts, there is no way that I could have mis-counted the instructions --- it took all of my fingers, but did not generate an overflow condition. ;-)

It turns out that there are a number of errata for both the instructions retired events and the uops retired event on all Intel Haswell processors. Somewhat perversely, the different Haswell products have different errata listed, even though they have the same DISPLAYFAMILY_DISPLAYMODEL designation, but all of them that I checked (Xeon E5 v3 (HSE71 in doc 330785), Xeon E3 v3 (HSW141 in doc 328908), and 4th Generation Core Desktop (HSD140 in doc 328899)) include an errata to the effect that the "instructions retired" counts may overcount or undercount. This errata is also listed for the 5th Generation Core (Broadwell) processors (BDM61 in doc 330836), but is not listed in the "specification update" document for the Skylake processors (doc 332689).

For this particular loop the counts are completely stable with respect to variations in loop length (e.g., from 500 to 11000 shows no effect other than asymptotically decreasing overhead). The machine is running with HyperThreading enabled, but there are no other users or non-OS tasks and this job was pinned to (local) core 4 on socket 1, so there is no way that interference with another thread (mentioned in several other errata) could account for seeing identical behavior over several hundred trials.

Reading between the lines, the language that Intel uses in the descriptions of this performance counter errata seems consistent with the language used in other cases for which the errors are not "large" (not approaching 100%), but are also not "small" (not limited to single-digit percentages). It is very hard to decide whether I want to take the time to try to characterize or bound this particular performance counter error. It may end up having an easy story, or it may end up being completely inexplicable without inspection of the processor RTL.

   
Test results for Broadwell and Skylake
Author:  Date: 2016-02-11 11:00
I notice that SKD044 on page 28 this PDF:

www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf

explains why the discrepancy occurs and how large it is likely to be for this chip. Similar errata for other chips seem to be less detailed, though I haven't checked exhaustively.

   
Description of discrepancy
Author:  Date: 2016-03-13 17:54
Jess wrote:
I notice that SKD044 on page 28 this PDF:

www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf

explains why the discrepancy occurs and how large it is likely to be for this chip.

I appreciate the link, but I'm unable to find the portion that you refer to. Could you point more exactly to the details you found?

SKD044 doesn't exist in that document, SKL044 is about WRMSR, and nothing on page 28 seems relevant. I did find SKD044 in a different document (http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/6th-gen-core-u-y-spec-update.pdf) but still about WRMSR. The closest erratum I did find was SKL048 "Processor May Run Intel AVX Code Much Slower than Expected", but this is only when coming out of C6, and doesn't give other details.

   
Test results for Broadwell and Skylake
Author:  Date: 2016-02-22 17:50
Thank you all for the useful information. FYI, the latest Intel architecture optimization manual discusses the Skylake changes for the mixed AVX / SSE problem in great detail, including diagrams and tables. This is in section 11.3 Mixing AVX code with SEE code in the January 2016 edition. Skylake has not eliminated the problem entirely, with "partial register dependency + blend" as the penalty in one mode, and ~XSAVE in another mode. Use of VZEROUPPER is still recommended, in rule 72. "The Skylake microarchitecture implements a different state machine than prior generations to manage the YMM state transition associated with mixing SSE and AVX instructions. It no longer saves the entire upper YMM state transition ... but saves the upper bits of individual register. As a result ... will experience a penalty associated with partial register dependency...".

Other topics discussed include "Align data to 32 bytes", which was recently discussed in this blog too. Section 11.6.1

There is lots and lots of Skylake material, including the tradeoffs between electrical power reduction vs. performance. Like "The latency of the PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as140 cycles... There's also a small power benefit in 2-core and 4-core systems... As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss." Section 8.4.7

   
Instruction Throughput on Skylake
Author:  Date: 2016-04-23 13:16
In the Section 11 "Skylake" of your Microarchitecture Guide (http://www.agner.org/optimize/microarchitecture.pdf), you say: "There are four decoders, which can handle instructions generating up to four μops per clock cycle in the way described on page 121 for Sandy Bridge" and "Code that runs out of the μop cache are not subject to the limitations of the fetch and decode units. It can deliver a throughput of 4 (possibly fused) μops or the equivalent of 32 bytes of code per clock cycle."

This seems contradicted by Section 2.1 "Skylake Microarchitecture" of the Intel Optimization manual (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf): "Legacy Decode Pipeline delivery of 5 uops per cycle to the IDQ compared to 4 uops in previous generations" and "The DSB delivers 6 uops per cycle to the IDQ compared to 4 uops in previous generations." These numbers also match Figure 2.1 in that guide, which makes me think the Intel manual is probably correct here.

About Skylake, you also say "It is designed for a throughput of four instructions per clock cycle." I've recently measured a few results that make me wonder if it's actually capable of more than that. Did you happen to do any tests that would confirm whether Skylake might be able to sustain 5 or 6 unfused instructions per cycle (thus possibly 7 or 8 including fused branches not taken) if the correct execution ports are available? From the published specs, I haven't been able to find evidence of a hard limit of 4 unfused instructions per cycle.

One stage for which I haven't been able to find documentation of the Skylake limits is retirement. Section 2.6.5 on Hyperthreading Retirement says "If one logical processor is not ready to retire any instructions, then all retirement bandwidth is dedicated to the other logical processor." I've seen claims that Skylake has "wider Hyperthreading retirement" than previous generations, and there is also a documented performance monitor event for "Cycles with less than 10 actually retired uops", which would imply that the maximum is at least 10. Do you know if this is true?

   
Instruction Throughput on Skylake
Author: Agner Date: 2016-04-24 00:02
Nathan Kurz wrote:
Did you happen to do any tests that would confirm whether Skylake might be able to sustain 5 or 6 unfused instructions per cycle (thus possibly 7 or 8 including fused branches not taken) if the correct execution ports are available?
NOPs have a throughput of 4 per clock cycle, and NOPs are not using any execution unit. I have never seen a higher throughput than 4 if you count a fused jump as one instruction. If two threads are running in the same core then each thread gets 2 NOPs per clock.

It is possible that the decoders have a higher throughput, but then there must be a bottleneck somewhere else. This will be hard to verify.

   
Instruction Throughput on Skylake
Author:  Date: 2016-04-26 13:50
Agner wrote:
It is possible that the decoders have a higher throughput, but then there must be a bottleneck somewhere else. This will be hard to verify.
I'm starting to understand this better. Using Likwid and defining some custom events, I've determined that Skylake can sustain execution and retirement of 5 or 6 µops per cycle. This is ignoring jump/cc "macro-fusion", which would presumably boost us up to 7 or 8. The bottleneck appears to be the "renamer", which can only "issue" 4 µops per cycle.
The question is "What constitutes a µop for this stage?"

In 2.3.3.1 of the Intel Optimization Guide, when discussing Sandy Bridge it says: "The Renamer is the bridge between the in-order part in Figure 2-5, and the dataflow world of the Scheduler. It moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine. Although the renamer can send up to 4 micro-ops (unfused, micro-fused, or macro-fused) per cycle, this is equivalent to the issue port can dispatch six micro-ops per cycle."

The grammar is atrocious, but I think it means that while the Renamer can only move 4 µops, these can be micro-fused µops that will be "unlaminated" to a load µop and an action µop. From what I can tell, Skylake can move 6 fused µops per cycle from the DSB to the IDQ, but can only "issue" 4 fused µops per cycle from the IDQ. But since the scheduler only handles unfused µops, this means that we can "dispatch" up to twice that many depending on fusion.

The result of this is that while it is probably true to say that Skylake is "designed for a throughput of four instructions per clock cycle", instructions per clock cycle can be poor metric to use when comparing fused and unfused instructions. Previously, I'd naively thought that once the instructions were decoded to the DSB, that it didn't matter whether one expressed LOAD-OP as a single instruction, or as a separate LOAD then OP.

But if one is being constrained by the Renamer, it turns out that it can make a big difference in total execution time. For example, I'm finding that in a tight loop, this (two combined load-adds):

#define ASM_ADD_ADD_INDEX(in, sum1, sum2, index) \
__asm volatile ("add 0x0(%[IN], %[INDEX]), %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index))


Is about 20% faster than this (two separate loads and adds):

#define ASM_LOAD_LOAD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"mov 0x8(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp))

While the hybrid (one and one) is the same speed as the fast version:

#define ASM_LOAD_ADD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp))


What I don't understand yet is why all variations that directly increment %[IN] are almost twice as slow as the versions that use and increment %[INDEX]:

#define ASM_ADD_ADD_DIRECT(in, sum1, sum2) \
__asm volatile ("add 0x0(%[IN]), %[SUM1]\n" \
"add 0x8(%[IN]), %[SUM2]\n" \
"add $0x10, %[IN]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2))

I also don't understand yet why I get 30% faster speeds for loops small enough to fit in the LSD than when unrolled such that the number of µops requires the DSB. Apparently the Loop Stream Detector still plays a performance roll in some cases.

   
Instruction Throughput on Skylake
Author: Agner Date: 2016-04-27 01:14
Nathan Kurz wrote:
The bottleneck appears to be the "renamer", which can only "issue" 4 µops per cycle.
I think the decoding front end and the renamer are designed with a 4-wide pipeline for a throughput of four µops per clock. These µops are queuing up in the reservation station if execution of them is delayed for any reason. The scheduler can issue more than 4 µops per clock cycle in bursts until the queue is empty.

I also don't understand yet why I get 30% faster speeds for loops small enough to fit in the LSD than when unrolled such that the number of µops requires the DSB. Apparently the Loop Stream Detector still plays a performance roll in some cases.
Instruction fetch and decode is often a bottleneck - you need to check the instruction lengths. Alignment of the loop entry can also influence the results. Finally, you will often see cache effects influencing the results in a less than obvious way.
   
Instruction Throughput on Skylake
Author:  Date: 2016-06-18 19:27
When you say:

> I think the decoding front end and the renamer are designed with a 4-wide pipeline for a throughput of four µops per clock.

Are you talking fused domain or unfused domain µops? Here I'm only interested in micro-fusion. Let's assume there are no opportunities for macro-fusion. If that's 4-wide in the fused domain, it implies that the processor could sustain 6 µops throughput in the unfused domain, if there are no 4 (or 5) wide bottlenecks downstream of the scheduler (e.g., issue or retirement). That would be a big deal since it implies that read-modify instructions may be highly preferred in many scenarios over separate two separate load, op r/r instructions.

Hi Nathan,

Are you able to share your results about 5 or 6 wide throughput? You hinted at them in your post, but anything reproducible would be great.

T

   
Instruction Throughput on Skylake
Author: Agner Date: 2016-06-19 00:59
T wrote:
If that's 4-wide in the fused domain, it implies that the processor could sustain 6 µops throughput in the unfused domain, if there are no 4 (or 5) wide bottlenecks downstream of the scheduler (e.g., issue or retirement).
Yes, it can do 6 µops in the unfused domain.
   
Instruction Throughput on Skylake
Author:  Date: 2016-07-08 02:50
T wrote:

Are you able to share your results about 5 or 6 wide throughput? You hinted at them in your post, but anything reproducible would be great.

All sharable, but I haven't been thinking about this direction for a couple months. I'll try to post something here if I can dig it up, but I won't be able to get to it immediately.

But if my recollection is correct, the short answer is that yes, Read-Modify instructions should almost always be used as heavily as possible for inner loops on modern Intel processors. They have significant upside if you would otherwise be limited by the renamer.

And while you say you are not interested in it, the corollary for micro-fusion is that CMP-JCC instructions should almost always be adjacent in assembly. I'm pretty sure that both GCC and LLVM would benefit from putting a higher penalty on the split.

   
Instruction Throughput on Skylake
Author:  Date: 2016-07-11 22:21
OK, here's my cleaned up test code.

// gcc -g -Wall -O2 fusion.c -o fusion -DLIKWID -llikwid [may also need -lm -lpthread]
// likwid-perfctr -m -g UOPS_ISSUED_ANY:PMC0,UOPS_EXECUTED_CORE:PMC1,UOPS_RETIRED_ALL:PMC2,BR_INST_RETIRED_NEAR_TAKEN:PMC3 -C 1 fusion

#include <x86intrin.h> #include <stdint.h> #include <stdio.h>

#ifdef LIKWID #include <likwid.h> #define MEASURE_INIT() \ do { \ likwid_markerInit(); \ likwid_markerThreadInit(); \ } while (0) #define MEASURE_FINI() \ do { \ likwid_markerClose(); \ } while (0) #define MEASURE(name, code) \ do { \ sum1 = sum2 = 0; \ likwid_markerStartRegion(name); \ code; \ likwid_markerStopRegion(name); \ printf("%s: sum1=%ld, sum2=%ld\n", name, sum1, sum2); \ } while (0) #else // not LIKWID #define MEASURE_INIT() #define MEASURE_FINI() #define MEASURE(name, code) \ do { \ sum1 = sum2 = 0; \ code; \ printf("%s: sum1=%ld, sum2=%ld\n", name, sum1, sum2); \ } while (0) #endif // not LIKWID

#define ASM_TWO_MICRO_TWO_MACRO(in1, sum1, in2, sum2, max) \ __asm volatile ("1:\n" \ "add (%[IN1]), %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "jae 2f\n" \ "add (%[IN2]), %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "jb 1b\n" \ "2:" : \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

#define ASM_NO_MICRO_TWO_MACRO(in1, sum1, in2, sum2, max, tmp1, tmp2) \ __asm volatile ("1:\n" \ "mov (%[IN1]), %[TMP1]\n" \ "add %[TMP1], %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "jae 2f\n" \ "mov (%[IN2]), %[TMP2]\n" \ "add %[TMP2], %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "jb 1b\n" \ "2:" : \ [TMP1] "=&r" (tmp1), \ [TMP2] "=&r" (tmp2), \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

#define ASM_ONE_MICRO_TWO_MACRO(in1, sum1, in2, sum2, max, tmp) \ __asm volatile ("1:\n" \ "add (%[IN1]), %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "jae 2f\n" \ "mov (%[IN2]), %[TMP]\n" \ "add %[TMP], %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "jb 1b\n" \ "2:" : \ [TMP] "=&r" (tmp), \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

#define ASM_ONE_MICRO_ONE_MACRO(in1, sum1, in2, sum2, max, tmp) \ __asm volatile ("1:\n" \ "add (%[IN1]), %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "mov (%[IN1]), %[TMP]\n" \ "jae 2f\n" \ "add %[TMP], %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "jb 1b\n" \ "2:" : \ [TMP] "=&r" (tmp), \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

// two separate loads and adds, two non-fused cmp then jcc #define ASM_NO_MICRO_NO_MACRO(in1, sum1, in2, sum2, max, tmp1, tmp2) \ __asm volatile ("mov (%[IN1]), %[TMP1]\n" \ "1:\n" \ "add %[TMP1], %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "mov (%[IN2]), %[TMP2]\n" \ "jae 2f\n" \ "add %[TMP2], %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "mov (%[IN1]), %[TMP1]\n" \ "jb 1b\n" \ "2:" : \ [TMP1] "=&r" (tmp1), \ [TMP2] "=&r" (tmp2), \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

int main(/* int argc, char **argv */) { uint64_t tmp, tmp1, tmp2; uint64_t sum1, sum2; uint64_t in1 = 1; uint64_t in2 = 1; uint64_t max = 10000000;

MEASURE_INIT();

MEASURE("two_micro_two_macro", ASM_TWO_MICRO_TWO_MACRO(&in1, sum1, &in2, sum2, max));

MEASURE("one_micro_two_macro", ASM_ONE_MICRO_TWO_MACRO(&in1, sum1, &in2, sum2, max, tmp));

MEASURE("one_micro_one_macro", ASM_ONE_MICRO_ONE_MACRO(&in1, sum1, &in2, sum2, max, tmp));

MEASURE("no_micro_two_macro", ASM_NO_MICRO_TWO_MACRO(&in1, sum1, &in2, sum2, max, tmp1, tmp2));

MEASURE("no_micro_no_macro", ASM_NO_MICRO_NO_MACRO(&in1, sum1, &in2, sum2, max, tmp1, tmp2));

MEASURE_FINI();

return 0; }

And here's what I see on Skylake:

nate@skylake:~/src$ likwid-perfctr -m -g UOPS_ISSUED_ANY:PMC0,UOPS_EXECUTED_CORE:PMC1,UOPS_RETIRED_ALL:PMC2,BR_INST_RETIRED_NEAR_TAKEN:PMC3 -C 1 fusion
CPU name:	Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPU type:	Intel Skylake processor
CPU clock:	3.41 GHz
--------------------------------------------------------------------------------
two_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_one_macro: sum1=10000000, sum2=9999999
no_micro_two_macro: sum1=10000000, sum2=9999999
no_micro_no_macro: sum1=10000000, sum2=9999999
--------------------------------------------------------------------------------
================================================================================
Group 1 Custom: Region two_micro_two_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 4.000816e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 6.000806e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 6.000724e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000056e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 6.000540e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 1.001363e+07 |
================================================================================
Group 1 Custom: Region one_micro_two_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 5.000502e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 6.000506e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 6.000471e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000040e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 7.000316e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 1.334216e+07 |
================================================================================
Group 1 Custom: Region one_micro_one_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 6.000435e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 7.000444e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 7.000445e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000039e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 7.000310e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 1.672351e+07 |
================================================================================
Group 1 Custom: Region no_micro_two_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 6.000429e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 6.000438e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 6.000438e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000039e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 8.000307e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 1.500636e+07 |
================================================================================
Group 1 Custom: Region no_micro_no_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 8.000476e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 8.000483e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 8.000466e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000039e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 8.000312e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 2.000775e+07 |

And on Haswell:

nate@haswell:~/src$ likwid-perfctr -m -g UOPS_ISSUED_ANY:PMC0,UOPS_EXECUTED_CORE:PMC1,UOPS_RETIRED_ALL:PMC2,BR_INST_RETIRED_NEAR_TAKEN:PMC3 -C 1 fusion
-------------------------------------------------------------
-------------------------------------------------------------
CPU type:	Intel Core Haswell processor
CPU clock:	3.39 GHz
-------------------------------------------------------------
fusion
two_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_one_macro: sum1=10000000, sum2=9999999
no_micro_two_macro: sum1=10000000, sum2=9999999
no_micro_no_macro: sum1=10000000, sum2=9999999
=====================
Region: two_micro_two_macro
=====================
|      UOPS_ISSUED_ANY       | 4.00061e+07 |
|     UOPS_EXECUTED_CORE     | 6.00062e+07 |
|      UOPS_RETIRED_ALL      | 6.00046e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 6.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 1.7392e+07  |
=====================
Region: one_micro_two_macro
=====================
+----------------------------+-------------+
|           Event            |   core 1    |
+----------------------------+-------------+
|      UOPS_ISSUED_ANY       | 5.00062e+07 |
|     UOPS_EXECUTED_CORE     | 6.00062e+07 |
|      UOPS_RETIRED_ALL      | 6.00046e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 7.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 1.4247e+07  |
=====================
Region: one_micro_one_macro
=====================
+----------------------------+-------------+
|           Event            |   core 1    |
+----------------------------+-------------+
|      UOPS_ISSUED_ANY       | 6.00065e+07 |
|     UOPS_EXECUTED_CORE     | 7.00065e+07 |
|      UOPS_RETIRED_ALL      | 7.00048e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 7.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 1.69403e+07 |
=====================
Region: no_micro_two_macro
=====================
+----------------------------+-------------+
|           Event            |   core 1    |
+----------------------------+-------------+
|      UOPS_ISSUED_ANY       | 6.00062e+07 |
|     UOPS_EXECUTED_CORE     | 6.00062e+07 |
|      UOPS_RETIRED_ALL      | 6.00046e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 8.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 1.57365e+07 |
=====================
Region: no_micro_no_macro
=====================
|      UOPS_ISSUED_ANY       | 8.00062e+07 |
|     UOPS_EXECUTED_CORE     | 8.00062e+07 |
|      UOPS_RETIRED_ALL      | 8.00046e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 8.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 2.0043e+07  |
+----------------------------+-------------+

The main thing to notice is that on Skylake the "two macro two micro" is fastest and executes at 1 cycle per iteration, while on Haswell is it slower than than a couple options with less fusion. BR_INST_RETIRED_NEAR_TAKEN is to show the number of loop iterations. Run time in cycles is shown by CPU_CLK_UNHALTED_CORE. The difference between INSTR_RETIRED_ANY and UOPS_RETIRED_ALL shows the effect of macro-fusion of CMP/JCC. The difference between UOPS_ISSUED_ANY and UOPS_EXECUTED_CORE shows the effect of micro-fusion of LOAD/ADD. UOPS_EXECUTED_CORE and UOPS_RETIRED_CORE are the same on both machines, showing that there is no branch misprediction occurring.

   
Instruction Throughput on Skylake
Author:  Date: 2016-07-17 14:14
Interesting results. Looks like it's about the number of renamed registers. Apparently, Hwl had lower TP restriction for renamer, and it was upgraded for Skl. This explains faster case for Hwl (more µops with less arguments each, but only up to certain point). Peak issue rate is still 4 fused µIPC from IDQ to rename, but 6 unfused µIPC (correspondent to up to 6 IPC) for retire for Skl. Hwl can't allow more than 5 unfused µIPC.
   
Haswell register renaming / unfused limits
Author:  Date: 2017-05-11 09:32
Tacit Murky wrote:
Looks like it's about the number of renamed registers.
Agreed. Simply changing Nathan's loops to use an immediate instead of a register for `max` produces a dramatic speedup on HSW:
  • Nathan's 2 micro / 2 macro on my HSW: one iteration per 1.42275c (~4.21 unfused-domain uops per clock). Very consistent, +- 0.0001 cycles per iter (for 1G iterations).
  • cmp r,imm instead of cmp r,max for both compares : one iteration per ~1.12c (~5.35 unfused-domain uops per clock). Pretty noisy, from 1.116c to 1.124c per iter.

My Skylake results match Nathan's: this bottleneck is gone, so the loop always runs at 1.0 cycles per iteration. (6 unfused-domain uops / clock). HSW and SKL measured with `perf stat` on Linux 4.8 and 4.10, counting only user-space counts for a statically linked binary. With 10^9 iterations on an otherwise-idle system, this is an easy way to get accurate low-noise numbers. Skylake isn't perfect: some runs are as bad as 1.02c / iter (for these and other loops). I think this is due to settling into a sub-optimal pattern rather than measurement noise, at least in some cases.

IDK why my HSW result is so much faster than Nathan's (1.42c instead of 1.73c). I measured on an i5-4210U and i7-6700k, with HT enabled but inactive (no other processes running). I still get stable and matching results even with max=10^7. The top of my loop is 32B-aligned, and both memory addresses are 64B-aligned.

I haven't tried to construct a loop that reads even more registers per clock. e.g. 3-operand FMA with a micro-fused memory operand, unrolled with different registers to avoid a latency bottlenck. Or ADC (flag input as well as flag output).

Hwl can't allow more than 5 unfused µIPC.

That's not right. With a somewhat artificial example, I can get HSW to sustain 6 unfused-domain uops per ~1.00 clocks (see below).

It seems more like a register-read limit, since reducing the number of input registers makes it run faster. (e.g. changing a macro-fused cmp/jcc to an inc helps. Perhaps also because of reduced resource conflicts (the not-taken branch stealing cycles on p6), but maybe not because Skylake doesn't have that problem.

    .loop:   ;; runs at 1.053c per iter on HSW
       add   rax, [rdi]
       inc    ebx
       blsi   rdx, [rsp]
       dec   ecx          ; ecx = max to start.
       jnz .loop

Predicted-not-taken CMP r,r/JCC has 2 inputs, 1 output (just flags). INC r has 1 input, 2 outputs (r and partial-flags).

With an ADD r,m instead of BLSI r,m, the loop runs at 1.08c per iteration on HSW. (Still about 1.00c on SKL). BLSI's destination register is write-only, unlike ADD's. This is also one fewer loop-carried dep chain, which may be significant. Replacing both ADDs with BLSI slows it down (to 1.076c per iter on HSW, 1.05c per iter on SKL), presumably because of imperfect scheduling leading to resource conflicts, since BLSI can only run on p15.

I got a slowdown on HSW and SKL from using imul r,m,imm to replace the second ADD, which is weird because its destination is write-only and out-of-order execution should easily hide its 3c latency. Presumably resource-conflicts for p1 are a problem. SKL: 1.29c to 1.55c (highly variable). HSW: more stable around 1.455c +- 0.05. IMUL writes flags, but BLSI doesn't. (Using add ebx,1 instead of inc didn't help, but using test ebx,ebx instead of inc did speed it up to about 1.18c on both HSW and SKL. I guess having 1 duplicated input and 1 output instead of 2 does help!) ---

With a somewhat artificial example, I can hit 1.005c on HSW (still not as fast as SKL's 1.0005c best-case for this: 10 times as far away from 1c per iter). Perhaps HSW is hitting PRF limitations. Using a micro-fused AVX instruction splits things between the integer and vector PRFs.

.loop:
    vpaddd xmm0, xmm0, [rdi]
    test ebx, ebx
    test rdx, [rsp]
    dec     ecx
    jnz .loop

Strangely, VPABSD xmm, m (write-only destination) was slower (1.04c) than VPADDD xmm0,xmm0,m (read-modify-write dest). This might be from resource conflicts, since it's also slower on SKL (1.004c to 1.015c). It's odd because HSW runs it on the same two ports as VPADDD. (SKL runs it on p01, but VPADDD on p015).

Avoiding the loop-carried dependency with VPADDD xmm0, xmm1, [rdi] was slightly slower on HSW (1.043c) than VPADDD xmm0,xmm0,[rdi], which smells like a register-read bottleneck on reading "cold" registers from the PRF.

Non-loop-carried dependency chains between two instructions in the loop seems to prevent it from running at 1c per iteration, even on SKL. (e.g. test ecx,ecx is a problem when ecx was written by the macro-fused loop-branch dec ecx/jnz, slowing HSW down to 1.068c). ----

Using indexed addressing-modes makes it run slower even on SKL. (But micro-fusion still happens on both HSW and SKL. Apparently un-lamination before the IDQ for indexed addressing modes only applies to SnB/IvB, not HSW! We already knew it didn't apply to SKL, but I had been assuming that change was new with SKL. I only got a HSW perf-counter test setup this week.)

  ;rsi=r8=0
  ;rsp and rdi are both 64B aligned.  rdi points into the BSS, in case that matters.
.loop:
    add  rdx, [rsp+rsi*4]

    cmp  r11, r12
    jne .end                   ; never taken, r11==r12

    add  ebx, [rdi+r8*4]

    sub ecx, r9d    ; alternatively,  sub ecx,1  to replace a reg with an immediate
    jnz .loop
.end:

Notice that although this is very similar to Nathan's two_micro_two_macro, there are no dependencies between any of the fused-domain uops. The loop-exit condition is just from decrementing ecx with a macro-fused uop.

This reads 7 "cold" registers (addressing modes, r11, r12, and r9), and 3 hot registers (rdx, ebx, and ecx) per iteration. It writes the 3 hot registers once each, and flags 4 times.

SKL runs it at 1.5566c / iter. Input registers per clock: 6.42 total, 4.50 cold, 1.93 hot. Total non-flag regs read+written per cycle: 8.35 = 13/1.5566. There's clearly a bottleneck, but IDK what it is. Touching fewer regs in the other 2 fused-domain uops makes it possible to micro-fuse indexed addressing modes and still run at 1c / iter on SKL.

HSW runs it a 1.6327c per iteration. Input registers per clock: 6.12 total, 4.29 cold, 1.83 hot. Total non-flag regs read+written per cycle: 7.96 = 13/1.6327.

uops_issued.stall_cycles shows that the front-end stalled instead of issuing a group of less than 4 (on HSW and SKL).

Reducing the number of cold inputs regs in different ways has different effects, so it's not as simple as just a bottleneck on that.

  • Changing the addressing mode on the second add to just [rdi], HSW runs it at 1.631c / iter. (very slightly faster than indexed)
  • Changing the CMP r11,r12 to TEST r11,r11 has no effect (same 1.6327c / iter)
  • Changing the CMP r11,r12 to CMP r11, 0 speeds it up to 1.594c / iter.
  • Changing the CMP r11,r12 to CMP r9d, 1 also speeds it up to 1.594c / iter (even though r9d is also read by sub, so it's not like P6-family register-read stalls where reading the same cold reg twice doesn't use extra resources)
  • Changing the CMP r11,r12/jne to CMP rdx,0/jl speeds it up to 1.35c / iter. (rdx was written the the previous ADD uop, so this macro-fused uop has no cold inputs anymore)
  • Using sub ecx,1 instead of sub ecx,r9d), HSW runs it at 1.3895c/iter +-0.0001 on HSW. Input regs per clock: 6.48 total, 4.31 cold, 2.15 hot. Total non-flags read+written: 8.64/c = 12/1.3895c.

The results of these different changes are similar on SKL; things that speed up HSW significantly also speed up SKL.

I'm not sure if it matters whether input registers are cold or not (read from the PRF vs. forwarded from a not-yet-executed uop), or whether there's a different cause for what I'm seeing. Futher testing is needed. Interesting things that could be tested:

  • micro-fused FMA with a base+index addressing mode should be a 4-input fused-domain uop. (or maybe this will be unlaminated)
  • On Skylake, ADCX / ADOX if they micro-fuse. (ADC doesn't, according to the instruction tables). Or even just ADC r,r might be interesting.
  • Does add r,r matter vs. andn r,r,r? I'm guessing not, since register renaming turns a RMW of an architectural register into a write to a new physical register anyway.
   
Haswell register renaming / unfused limits
Author: Tacit Murky Date: 2017-05-11 13:33
Very intresting, thanks. Maybe, by replacing in your 1st example «inc ebx» with «mov [non-conflicing-address], r» you can get 7 unfused µops and 11 GPR reads per clock — if the hardware allows this.
   
Haswell register renaming / unfused limits
Author:  Date: 2017-05-12 20:22
Agner, your insn table says cmovcc r,m and adc r,m don't micro-fuse at all on HSW/SKL, but that doesn't match my experiments. They do micro-fuse on both HSW and SKL. (I didn't check SBB r,m).

I assume indexed addressing modes for cmov/adc are still fused in the decoder and un-laminated later, but I didn't check that. All I can see is that they're not micro-fused when they issue/retire.

I just made a major update to stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes, after testing things on HSW and SKL.

Peter Cordes wrote:

Interesting things that could be tested:
  • micro-fused FMA with a base+index addressing mode should be a 4-input fused-domain uop. (or maybe this will be unlaminated)
  • On Skylake, ADCX / ADOX if they micro-fuse. (ADC doesn't, according to the instruction tables). Or even just ADC r,r might be interesting.
Answer: FMA/ADC/CMOV on HSW and SKL are un-laminated with indexed addressing modes, so we can't have 4-input fused-domain uops.

This applies even to ADC/CMOV on Haswell, where they decode to 2 uops. So that's weird. I'm guessing they simply left those instructions alone from IvyBridge; maybe they ran into deadlines and didn't have time to change them until Broadwell. i.e. maybe they decided not to invest time in getting 3-input micro-fused uop support right when they knew they really wanted to make the register-source version a single uop (that would behave like FMA and un-laminate indexed addressing modes,).

Unanswered questions: does un-lamination happen before the IDQ, or only at issue?

---------------

Re: Tacit Murky's suggestion to use a store to achieve 7 unfused-domain uops per clock: Good idea, this worked. Surprisingly, it even got it to run at 1.0 iterations per clock on SKL, with none of the stores stealing p23 from the loads.

;HTML pre is double-spacing this, so I'm just going to leave it flat :/
.loop: ; HSW: 1.12c / iter. SKL: 1.0001c
add edx, [rsp]
mov [rax], edi
blsi ebx, [rdi]
dec ecx
jnz .loop

SKL: 7 unfused uops per clock. HSW: 6.25. Register-reads per clock: 6 (not counting flags) total on SKL.

In my previous testing, I had assumed 32 vs. 64b operand-size didn't matter. But this loop runs at 1 iter per 1.12c with a 64b add, vs. 1.000c with a 32b add, on SKL. Totally bizarre. All three memory ops are in separate cache lines. I forget if that mattered.

The store has to be a simple addressing mode to run on port7, which is of course essential. IDK why HSW only runs this at 1.12c per iter, not nearly as close to 1.00 as SKL.

blsi r, [r+r] is 2 fused-domain uops, which is unexpected. (Changing it to an add is also a slowdown, I think because of reading the destination register).


With maximum register-reads:

.loop: ; HSW: 1.75c SKL: 1.42c.
add edx, [rsp+rsi]
mov [rax], edi ; An indexed store brings us up to HSW: 1.90c SKL: 1.55c
add ebx, [rdi+r8]
sub ecx,r9d ; = 1
jnz .loop

Register reads per clock: HSW: 10/1.75 = 5.71 /c total. SKL: 7.04/c total. Or with an indexed store: HSW: 5.79/c total GPRs read, SKL: 11/1.55 = 7.08/c.

-------------

To test for issue/rename bottlenecks vs. execution bottlenecks, I could make the loop longer and have a section of all-micro-fused instructions, and then a section of "easy" instructions. So the OOO core can easily keep up on average if the front-end issues 4 fused-domain uops per clock. But to do that, it would have to issue 8 unfused uops in a single cycle without stalling if there are at least 7 micro-fused uops in a row. I'll try that later, when I have time to get back to this.

   
Instruction Throughput on Skylake
Author: T Date: 2016-08-08 01:57
Thank you very much for that. It is really interesting and implied that compilers and assembly writers should tune differently for Haswell vs Skylake. I wonder if icc has been updated to reflect it?
   
Unlamination of micro-fused ops in SKL and earlier
Author:  Date: 2016-09-09 19:36
There is an interesting effect which changed in Skylake (or at least some architecture after Sandy Bridge, up to and including Skylake), but isn't covered in your manual. It concerns the behavior of micro-fused instructions with *complex* memory source or destination operands. Here complex means with base and index registers, so something like

add rax, [rbx + rcx]

In Sandy Bridge, this doesn't seem to micro-fuse in the same way as simpler addressing modes such as:

add rax, [rbx + 16]

In particular, while it seems that the complex address modes fuse in the uop cache, the constituent ops are later "unlaminated" and consume rename and retirement resources. In particular, this means that you cannot achieve 4 micro-fused uops/cycle throughput with these addressing modes. The Intel optimization doc does touch on it briefly in 2.3.2.4 Micro-op Queue and the Loop Stream Detector (LSD):

In particular, loads combined with computational operations and all stores, when used
with indexed addressing, are represented as a single micro-op in the decoder or Decoded ICache.
In the micro-op queue they are fragmented into two micro-ops through a process called un-lamination,
one does the load and the other does the operation. A typical example is the following "load plus operation"
instruction:

ADD RAX, [RBP+RSI]; rax := rax + LD( RBP+RSI )

The Intel section is a bit unclear because they don't make it very explicit obvious that this only applies to indexed addressing modes, and that if you don't use index addressing you potentially achieve higher throughput.

This issue could be pretty critical for optimization of high IPC loops, on a par with many similar issues covered in your doc. In particular, it means jumping through a few hoops to be able to use a simpler addressing mode could be worth it - beyond the latency benefits already documented in your guide (and beyond the ability to use port 7 AGU for store address calculation as well).

It might be nice to add it to your doc! There is an extensive investigation on this stackoverflow question, which is what prompted me to post here . See in particular the answer from Peter Cordes who shows the issue on Sandy Bridge. In another answer I have some tests that show the limitation is removed on Skylake, but we don't know exactly in which arch it was removed. The Intel doc is mostly silent on that topic (unlamination is only discussed in the one SB-specific section I linked above). If you have some other machines at your disposal I have some code here that makes it easy to test the behavior (on Linux).

   
32B store-forwarding is slower than 16B
Author:  Date: 2017-05-11 10:37
Your microarch manual says that store-forwarding latency is 5c on Skylake for operand sizes other than 32/64b. I can confirm 5c for 128b vectors, but I've found that 256b store-forwarding is 6c on Skylake. I see your instruction tables already reflect this, so it's just a wording error in the microarch guide.

Also, in your instruction tables, you say that splitting up the store-forwarding latency between stores and loads is arbitrary. I disagree: It would be nice if loads listed the L1 load-use latency (from address being ready to data being ready). I don't think this is the case currently (e.g. you list Merom/Wolfdale/NHM/SnB's mov r,m as 2c latency, which is unreasonably low.)

If there are any CPUs where store-forwarding is faster than L1 load-use latency, that would mean negative latency for stores. But that's not the case on any x86 microarchitecture, I think.

----

While testing this on HSW and SKL, I found something weirder: an AVX128 load into an xmm register (zero-extending to 256) has an extra 1c of latency when read by a 256b instruction.


  • SKL: 12c for 3x dependent vmulps (xmm or ymm). HSW:15
  • 17c for 3x vmulps xmm and store/reload xmm. HSW:21. SF=5c/6c
  • 18c for 3x vmulps ymm and store/reload xmm. HSW:21 SF=6c/6c, or is it 5+1c?
  • 18c for 3x vmulps xmm and store/reload ymm. HSW:22 SF=6c/7c
  • 18c for 3x vmulps ymm and store/reload ymm. HSW:22 SF=6c/7c



vxorps xmm0,xmm0,xmm0
.loop:
vmulps ymm0, ymm0,ymm0
vmulps ymm0, ymm0,ymm0
vmulps ymm0, ymm0,ymm0
vmovaps [rdi], xmm0 ; This is the weird case for SKL: xmm store/reload with ymm FPU
vmovaps xmm0, [rdi]
dec ecx
jnz .loop

Also strange, with the mulps instructions commented out, I'm seeing SKL run the loop at only ~6.2c to 6.9c per iteration for *just* ymm store->reload with no ALU, rather than the expected 6.0c. So is there a limit to how often a 256b store-forward can happen? With xmm store/reload (and just a dec/jne But for xmm, the loop runs at one per 5.0c best case, sometimes as high as 5.02c per iter.

Same pattern for integer vectors: SKL doesn't benefit from narrowing the store/realod to xmm when the ALU loop is using ymm.

9c for 3x vpermd ymm SKL and HSW
15c for that + store/reload xmm (SKL and HSW). SF latency = 6c. (or 5+1c / 6c?)
15c for that + store/reload ymm SKL, 16c HSW. (movaps or movdqa). SF lat = 6c SKL, 7c HSW.

3c for 3x vpunpckldq ymm or xmm (SKL/HSW)
8.08 to 8.23c for vpunpck xmm + store/reload xmm. 9c HSW. SF=5.15c / 6c. (stabilizes to 5c / 6c with a longer ALU dependency chain between store/reload)
9c for vpunpck ymm + store/reload xmm (SKL). 9c HSW. SF=5+1c? / 6c
9c for vpunpck xmm + store/reload ymm. 10c HSW. SF=6c / 7c
9c for vpunpck ymm + store/reload ymm (SKL). 10c HSW. SF=6c / 7c

Using vmovaps vs. vmovdqa made no difference between either ivec or FPU instructions. rdi is pointing to a 64B-aligned buffer in the BSS.

So I'm seeing unstable results on SKL for doing a 128b store-forwarding with only 3c of ALU latency between the load and doing another store to the same address. Inserting more shuffles so fewer store-forwardings need to be kept in-flight stabilizes things so the store-forwarding latency is the expected 5.0c. HSW doesn't have that problem.

If the first shuffle is xmm and the others are ymm, then xmm store/reload only has 5c latency on SKL. So there's no extra latency for an ALU instruction to zero-extend, but there is for a load?

   
32B store-forwarding is slower than 16B
Author:  Date: 2017-06-28 18:33
I believe this is, ultimately, an artifact of how 256-bit vectors are implemented internally. Namely, I believe that the lower and upper 128-bit data paths are 1 cycle offset from each other (the upper datapath issues its half of the instruction one cycle later than the lower datapath does). [This helps switching between true 256-bit execution and 256bit-cracked-into-two-128bit-ops execution, because the load operand etc. timings are the same in both cases; this should simplify the load path and the bypass network.]

This is also my leading guess for the explanation of the 3-cycle latency of operations that cross the 128-bit halves: the lower and upper 128 bits are not only skewed in time, they are also separate bypass domains. So potentially cross-128b operations like vextracti128 have 1 extra cycle of latency purely from upper half of the input being available 1 cycle later than the lower half, and another extra cycle cross-domain bypass delay to shuttle the result from the upper bypass to the lower datapath.

Anyway, all of this is speculation, but if correct, then while 256-bit stores have full throughput (when running in 256-bit mode anyway), the second half of their data arrives in the designated store buffer slot one cycle later, and the store buffer is only marked as "data available" (with the values available for forwarding) once both halves have arrived. Thus the extra 1-cycle forwading delay.

   
32B store-forwarding is slower than 16B
Author: Agner Date: 2017-06-28 23:57
Fabian Giesen wrote:
I believe this is, ultimately, an artifact of how 256-bit vectors are implemented internally. Namely, I believe that the lower and upper 128-bit data paths are 1 cycle offset from each other
I have found no evidence of the two 128-bit lanes being offset by one clock. Why would they do this if all execution units are 256 bits? It's a matter of physical distances on the chip. I think that all units belonging to the same 128-bit lane are clustered together to minimize the length of data paths within the same lane. Any instruction that transports data between different 128-bit lanes has an extra clock cycle delay for moving data from one lane-cluster to another. I guess the 256-bit store somehow uses the permute machinery or some lane-crossing paths even though the write port is 256 bits.
   
SHL/SHR r,cl latency is lower than throughput
Author:  Date: 2017-05-27 17:00

Your table lists variable-count SHL and SHR as 2c throughput, 2c latency. It appears that the 2c latency is only for flags. My results match yours for consecutive SHL instructions, but SHL is faster if surrounded by instructions that write all flags without reading them. (This is one case where ADD 1 is preferable to INC). In that case, it can achieve 1.5c throughput.

For SHL r,cl the latency from r to r, and from cl to r, is much less than 2c. (I measure more than 1c, but maybe only because of resource conflicts). I think only one of the three p06 uops is the actual shift that writes the dest reg (probably the same internally as SHLX/SHRX), while the other two are purely for flag-handling. We know it's 2c from input-flags -> output-flags, but I didn't measure the latency from r or cl to flags.

I think the instruction table should say: lat=1 tput=1.5 with a note saying "EFLAGS dependency limits throughput to 2c for consecutive shifts, and resource conflicts raise the average latency for the register operands". That's a lot to stick in a note, but 2c/2c does not reflect the performance in real use-cases very well at all. It's still a lot worse than SHLX, but not as bad as that.


mov eax, 1000000000 ; I can't figure out how to get a PRE tag to not double-space, please fix if possible
mov ecx, 3
align 32
.loop:
add edx,1
add edx,1
shl edx, cl
add edx,1
add edx,1

sub rax, 1
jnz .loop

perf counters from an otherwise-idle i7-6700k, using ocperf.py
5,228,964,721 cycles:u # 3.841 GHz
7,000,000,418 instructions:u # 1.34 insn per cycle
1,000,000,412 branches:u # 734.565 M/sec
8,000,128,015 uops_issued_any:u # 5876.614 M/sec
8,000,101,258 uops_executed_thread:u # 5876.594 M/sec

Without the SHL, the loop of course runs at the expected 4c per iter. The SHL slows it down by 1.229 cycles, not 2. Haswell goes from 4c to 5.296c, so the slowdown is higher (~1.30 instead of ~1.23).

With 13 dependent ADD instructions and one SHL in the loop, Skylake goes from 13c to 14.35c, but Haswell goes from 13c to 14.19c. So it's very weird and inconsistent, with Haswell seeing lower SHL latency the more infrequent they are, but SKL doing better when they're more frequent.

Results are fairly similar for SHL ecx, cl (so the shift-count input doesn't need to be ready early).

I was also able to hit 1.5c throughput for independent shifts with the same count by breaking SHL's flag input-dependency:


.loop:
shl r8d, cl
add ebx,1 ; xor edx,edx also works here
shl r9d, cl
add esi,1
shl r10d, cl

sub eax, 1 ; not DEC
jnz .loop

5,000,450,873 cycles:u # 3.898 GHz
7,000,000,393 instructions:u # 1.40 insn per cycle
1,000,000,387 branches:u # 779.520 M/sec
12,000,132,094 uops_issued_any:u # 9354.338 M/sec
12,000,102,844 uops_executed_thread:u # 9354.315 M/sec

Results are the same on HSW and SKL to within measurement error. 5c per iteration with 3 SHL instructions in the loop is 1.666c throughput, bottlenecked on p06 throughput (including the loop-branch which has to run on p6). 3*3 + 1 = 10 p06 uops, which takes at least 5 cycles to execute.

Be careful of uop-cache issues when testing: making the loop longer creates a situation where the loop bottlenecks on the front-end, because it's too dense to fit in the uop cache. e.g. adding another xor/shl pair make a loop of 16 fused-domain uops which works as expected on SKL: 6.5 cycles per iter to execute 13 p06 uops, even though they're coming from the legacy decoders. But HSW only manages 8c throughput, apparently bottlenecked on the front-end. Using long instructions like ADD rsi, 12345 (7 bytes), and putting redundant REP prefixes on the add and shift instructions restores performance on HSW, as soon as it fits in the uop cache and can issue from the LSD.

   
Test results for Broadwell and Skylake
Author:  Date: 2017-05-30 12:31
more than year ago, i wrote you that skylake will have single-issue avx-512. just a bit more detais why i lead tio this conclusion (partly copied from my post on anand):

i can give you details about avx-512 - they are pretty obvious from analysis of skylake execution ports. so

1) avx-512 is mainly single-issue. all the avx commands that now are supported BOTH on port 0 & port 1, will become avx-512 commands supported on joined port 0+1

2) a few commands that are supported only on port 5 (this are various bit shuffles), will be also single-issued in avx-512, which still means doubled perfromance - from single-issued avx-256 to single-issued avx-512

3) a few commands that can be issued on any of 3 ports (0,1,5), including booleans and add/sub/cmp - so-lcalled PADD group, will be double-issued in avx-512, so they will get 33% uplift

overall, ports 0&1 will join when executing 512-bit commands, while port 5 is extended to 512-bit operands. joined port 0&1 can execute almost any avx-512 command, except for a bit shuffle ones, port 5 can execute bit shuffles and PADD group

---------

when going from sse to avx, intel sacrificed easy of programming for easy of hardware implemenation, resulting in almost full lack of commands that can exchange data between upper&lower parts of ymm register (so-called lanes). avx-512 was done right, but this means that bit shuffle commands require a full 512-bit mesh. so, intel moved all these commands to port 5 making it an only full 512 bit port, while most remaining commands were moved into ports 0&1 where 512-bit command can be implemented as simple pair of 256-bit ones

looking at power budgets, it's obvious that simple doubling of execution resources (i.e. support of 512 bit commands instead of 256-bit ones) is impossible. in previous cpu generation, even avx commands increased energy usage by 40%, so it's easy to predict that extending each executed command to 512 bits will require another 80% increase

also, it's easy to compare skylake with broadwell and see many strange changes:

1) intel m/a implements SIMD commands on ports 0/1/5 and usually tries to equally spread commands among these 3 ports to increase final perfromance. but skylake is much more asymmetric in that regard - it implements all but but bit shuffle commands on ports 0 & 1

2) skylake tries to implement commands BOTH on ports 0 & 1 with maniacal diligence, including such rarely-used commands as PMUL and PCMPGTQ. as result, PCMPGTQ throughput was quadrupled! and PMUL now supported by 2 ports while scalar MUL only on one. You will fimnd many more examples, while only extremely expensive commands like divsion doesn't got the doubled throughput

3) when intel added avx/avx2 in SB/HW, it decreased throughput of some commands - f.e. Nehalem had double-issue both for bit shuffle and bit-combine commands, while SB/HW reduced their throughput to 1. So, if skylake are going to add avx-512 suppor, it may be expected that it will do the same (i.e. reduce thrpughput of rarely used commands), again to reduce power/transistor budget. But in practice, it doubled throughput of many commands while keeping single-issue throughput of shuffles. Idea that ports 0&1 will co-execute 512-bit commands, while port 5 will extend all its commands to 512 bits, excellently explains why it was made, while idea that everything will be just extended to 512 bits, fails miserable

so, once i read Intel optimization manual, and thought a while, it became obvious. Moreover, i believe that skylake implemented all 4 ISA extensions that Intel was marketed (sgx/mpx/sha/avx3) but they were not enabled earlier due to marketing/market-slicing requirements. Intel just need a counter-weapon against Ryzen, so it doesn't show all Skylake sthrength in 2015 when its posiition was already strong

---------

of course, m/a analysis can't say anything about commands absent in avx2 set, so my guess that predicate register manipulations will also go to port 5, just to make the m/a a bit less asymmetric

also it's easy to predict that in the next generations the first "improvement" will be to add FMAD capability to port 5, further doubling the marketing perfromance figures

   
Test results for Broadwell and Skylake
Author: Agner Date: 2017-05-30 12:49
Bulat Ziganshin wrote:
more than year ago, i wrote you that skylake will have single-issue avx-512.
I think this kind of speculation is unsound if you have no inside information.
   
Test results for Broadwell and Skylake
Author:  Date: 2017-05-30 16:24
i think we will see that in a few weeks :) please keep that message, so we can compare it to the facts

i have no insider info, just thorough knowledge of all these microarchitectures, from your and intel manuals. as you see, my analysis stands on strange sides of SKL microarchitecture - proposed implementation perfectly explains them all.

SKL doubled and someyimes even quadrupled throughput of many commands in order to make ports 0&1 highly symmetric. and this doesn't make any sense, other than preparing these ports to perfrom 512-bit commands in tandem.

SKL moved all but shuffle commands to ports 0&1 - and i think that is because only shuffle commands cannot be split into two 256-bit subcommands, so only these commands require port with a full 512-bit capability, and they dedicated port 5 to that task

yes, my explanation is highly speculative, but i don't see other possible explanations of all these changes which made avx256 execution less efficient (because most of commands now are executed only by ports 0&1), nor explanations why many rare commands got higher throughput. if intel plan to just extend each 256-bit command to 512 bits, they will, on opposite, reduce throughputs of rarely-used commands (as it was done in SB/HW compared to Nehalem), and keep ports 0/1/5 equally-populated

just one question - are you agree that SKL changes compared to HW are strange and either decrease performance (moving most commands to ports 0&1), or add more hardware for a tiny speedup (implementation of almost everything on BOTH ports 0&1)?

btw, one hint is that Intel claims their 18-core cpu will outperform 1 TFLOPS. If skl-x will perfrom two 512-bit fma commands per cpu cycle, they may easily claim breaking 2 tflops barrier

   
Test results for Broadwell and Skylake
Author:  Date: 2017-06-19 20:22
> so we can compare it to the facts

So we have some info out now: www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/3

From the article:

> Nominally the FMAs on ports 0 and 1 are 256-bit, so in order to drive towards the AVX-512-F these two ports are fused together, similar to how AVX-512-F is implemented in Knights Landing. The six-core and eight-core Skylake-X parts support one fused FMA for AVX-512-F, although the 10-core will support dual 512-bit AVX-512-F ports, which seems to be located on port 5. This means that the 10-core i9-7900X can support 64 SP or 32 DP calculations per cycle, whereas the 8-core/6-core parts can support 32 SP or 16 DP per cycle.

I don't recall Intel ever doing anything similar for product segmentation in the past, so limiting execution ports on cheaper SKUs seems to be a first to me.
Anyway, it sounds like you were on the ball, except that port 5 can also do FMA on higher SKUs, for more FLOPS.

   
Test results for Broadwell and Skylake
Author:  Date: 2017-06-20 12:18
- wrote:
> so we can compare it to the facts

So we have some info out now: www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/3

From the article:

> Nominally the FMAs on ports 0 and 1 are 256-bit, so in order to drive towards the AVX-512-F these two ports are fused together, similar to how AVX-512-F is implemented in Knights Landing. The six-core and eight-core Skylake-X parts support one fused FMA for AVX-512-F, although the 10-core will support dual 512-bit AVX-512-F ports, which seems to be located on port 5. This means that the 10-core i9-7900X can support 64 SP or 32 DP calculations per cycle, whereas the 8-core/6-core parts can support 32 SP or 16 DP per cycle.

I don't recall Intel ever doing anything similar for product segmentation in the past, so limiting execution ports on cheaper SKUs seems to be a first to me.
Anyway, it sounds like you were on the ball, except that port 5 can also do FMA on higher SKUs, for more FLOPS.

Based also on Bulat Ziganshin comments, I'd find highly unlikely a 3 uOp dual AVX-512 issue (2 256-bit uOps on Ports 0-1 + 1 512-bit uOp on port 5). I'd assume, the 256 bit units on Ports 0-1 were kept mostly intact (except for some extra vector-mask logic required for AVX-512) on the six-and-eight core variants whereas both ports were expanded to 512-bit for the 8+ core variants. (Since as mentioned before, most functionalities were "cloned" between them on Skylake's first iteration).

Also, let's recall that Knights Landing don't "fuse ports" for AVX-512 executions, since both per-core-VPUs are 512-bit wide :)

   
Test results for Broadwell and Skylake
Author:  Date: 2017-06-20 14:37
Adding on to my previous comment, the point is that otherwise than some ambiguous decoding scheme for AVX-512 (1 512-bit uOp vs 2 256-bit uOps), which would be entirely dependant on efficiently monitoring which ports (Combined Port "0+1" or Port 5) would be available sooner,or it would look more like a Pentium M/Core multiple-port-single-execution-unit scheme (i.e. where some instructions can be dispatched through multiple ports but share some common execution units). The "Combined Port 0+1" concept would be very similar to the Pentium M/Core scheme, which as stated in Agner's manuals can lead to some performance issues when mixing Dual-port (combined) agains Singe-port instructions, mainly when the shared execution units are used.

Or there is even one last option (but I bet higly unlikely too) : Only one of the 256-bit units (maybe Port 0) being promoted to 512-bit on the lower-end model and both on 512-bit for the higher end models.

   
Test results for Broadwell and Skylake
Author:  Date: 2017-06-21 12:49
russian review https://3dnews.ru/954174 , as usual, has more thorough low-level benchmarks than anand. In particular, important test: https://3dnews.ru/assets/external/illustrations/2017/06/19/954174/avx-512.png

As we can see here, FP computations got almost 2x speedup, while INT got only 20-40% improvements

I think, the last result perfectly lines with my prediction - port5 was extended to 512 bits, so bit shuffling becomes 2x faster, and PADD group got 33% boost. I expected 10-20% overall speedup, but probably new AVX512 features (new instructions, built-in masking) further improved the performance

My last prediction was: "also it's easy to predict that in the next generations the first "improvement" will be to add FMAD capability to port 5, further doubling the marketing performance figures"

I don't expected it in Skylake generation due to excessive TDP increase (as we know, even using AVX2 on previous generations increased TDP by 40%, so two full-featured AVX512 ports should *further* increase TDP by 80%!). Nevertheless, they have done exactly that, and got very expected TDP problems.

Note that from 3dnews test, we can draw conclusion that port5 added only FMA engine, but no other AVX512 commands (except for mere extension of AVX2 commands already populated on this port)

So, i can say that my speculation turned to be 200% right :)


But refreshing all that we know, it seems that from technical VP, skylake is a total mess! The SKL architecture i predicted was compromise - it added as little as possible hardware unused in AVX256 mode, but still had AVX512 support. It was a great step toward future processors - add 512-bit support for forward compatibility, but don't invest heavily in AVX512-only hardware until more 512-bit programs will arrive. To reach this goal, they made some changes that were bad for AVX2 programs (see my second post)

But when they added the second FMA512 engine, this became meaningless. Now we have design that both limits AVX2 performance and has a lot of hardware unused in AVX2 mode! By simple extending Haswell engines 2x we can got a bit higher transistor count and much better AVX512 performance

I think this is result of marketing games - SKL-S already had AVX512 support (without second FMA engine, though), but they decided to disable it on all SKUs. Newer SKL-X added the second engine, but enabled it only on selected SKUs, so i7 provides exactly the architecture i predicted (and probably it was their Plan B - use SKL-S cores with a single FMA engine for HEDT/Xeon products)

Now we can also see why SKL-S reduced L2$ associativity to 4. It was preparation to increasing cache size - SKL-S cache is just a quarter of SKL-X cache with the same organization, and reduced associativity allowed to reduce transistor budget of massive 1MB cache. This is a sign that SKL-X is much smaller modification of SKL-S core than we can think at the first sight

   
Test results for Broadwell and Skylake
Author:  Date: 2017-06-26 08:35
Bulat Ziganshin wrote:
russian review https://3dnews.ru/954174 , as usual, has more thorough low-level benchmarks than anand. In particular, important test: https://3dnews.ru/assets/external/illustrations/2017/06/19/954174/avx-512.png

Notice from https://3dnews.ru/assets/external/illustrations/2017/06/19/954174/cpuz-1.png that (probably Xeon-only) AVX512BW,DQ,VL are missing. It's also still unknown how much of the integer performance increase also was due to improved Gather-Scatter and bigger L2$.
Maybe one could expect for some Sandy Bridge/Ivy Bridge-style doubled FP performance (with almost-same integer) for Ports 0 and 1 plus an 512-bit shuffle on port 5. This might be a more reasonable descision, since it would have minimal ammount of non-AVX512 HW as opposed to a full 512-bit FMA on port 5.
I'm still wondering how an asymetrical (256-bit uOps vs 512-bit uOp) decoding scheme could be made for AVX-512 for ports 0+1 & 5.
   
Test results for Broadwell and Skylake
Author:  Date: 2017-07-05 21:07
I came across an instruction latency dump for the 7900X (10 core, dual issue 512-bit FMA): users.atw.hu/instlatx64/GenuineIntel0050654_SkylakeX_InstLatX64.txt

From a cursory scan, it actually looks like AVX-512 is mostly dual issue on this CPU. Generally 512-bit instructions have the same instruction throughput as 256-bit instructions, except for instructions implemented on 3x 256-bit ports, which get "reduced" to 2x 512-bit ports.
Presumably the lower SKUs are mostly single issue?

Interesting to also note that the K* mask instructions appear to be implemented on a single port only.

   
Test results for Broadwell and Skylake
Author:  Date: 2017-07-12 00:13
Slide from Intel: https://www.pcper.com/image/view/83900?return=node%2F68093
Also from that article, interesting to note that the reduced AVX clocks also depend on the type of instruction; presumably, this means that 256b integer AVX2 code won't be throttled, as opposed to 256b FP code.

Intel's optimization manual has also been updated, with more details: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

a 1MB L2 cache and an additional Intel AVX-512 FMA unit on port 5 which is available on some parts.

Since port 0 and port 1 are 256-bits wide, Intel AVX-512 operations that will be dispatched to port 0 will execute on both port 0 and port 1; however, other operations such as lea can still execute on port 1 in parallel. See the red block in Figure 2-3 for the fusion of ports 0 and 1.

Notice that, unlike Skylake microarchitecture for client, the Skylake Server microarchitecture has its front end loop stream detector (LSD) disabled.

The guide also provides an example on how to detect 1 or 2 FMA unit chips (section 13.20), which seems to compare shuffle+FMA throughput with FMA throughput (not detected via CPUID it seems :O).

Also interesting to note is that mixing 256b and 512b instructions causes the CPU to run in '512b port mode' (section 13.19), where the 256b instructions only the the throughput of the equivalent 512b instruction.

   
Test results for Broadwell and Skylake
Author:  Date: 2017-07-19 06:52
- wrote:
https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

Also interesting to note is that mixing 256b and 512b instructions causes the CPU to run in '512b port mode' (section 13.19), where the 256b instructions only the the throughput of the equivalent 512b instruction.

Also in 13.19 : "The maximum register width in the reservation station (RS) determines the 256 or 512 port scheme."
I guess this was the adopted solution for avoiding vector stalls on Port 1, when Port 0 is used under the Port 0+1 AVX-512 scheme, even though it puts a higher stress on Port 5 as said in the manual.
   
Test results for Broadwell and Skylake
Author:  Date: 2017-06-28 09:54
Anybody get hands on 7900X? Could you please have a test on the gather/scatter performance?
Interested to know if the throughput of AVX2/AVX512 gather instruction improved.
   
Test results for Broadwell and Skylake
Author:  Date: 2017-06-29 11:24
It would be pretty surprising if the gather performance improved much since at least since the original Skylake it was pretty much at 0.5 cycles/element, which is the limit of the DCU. So only by adding additional load hardware (quite expensive) or by optimizing certain gathers with identical or nearby elements (e.g,. from the same cache line) could I see the throughput going up much.

The latter approach probably doesn't cost too much, but only benefits certain workloads and not the general gather capability.

   
Test results for Broadwell and Skylake
Author:  Date: 2017-06-30 03:21
According to the following URL:
techreport.com/review/32111/intel-core-i9-7900x-cpu-reviewed-part-one

SKYLAKE-X is able to handle two 64-byte loads per cycle, so there is a chance the throughput of gather can be improved.

   
Test results for Broadwell and Skylake
Author:  Date: 2017-07-13 13:07
According to the following URL:
techreport.com/review/32111/intel-core-i9-7900x-cpu-reviewed-part-one

SKYLAKE-X is able to handle two 64-byte loads per cycle, so there is a chance the throughput of gather can be improved.

Not really, because it's still two loads, just the load width has been increased to 64B. The gather implementations are limited by the *number of loads*, and not by the width of the loads (indeed, even 32 bytes was already much wider than the largest gather elements of 8 byte).

Gather pretty much runs at 2 loads/cycle on the existing implementations, so unless you break that barrier (i.e., go to 3 load ports) you are very unlikely to see scatter perform better than that in the general case. What you might see first are optimizations for special cases of adjacent or overlapping elements, but that's more or less orthogonal to load width.
Xing Liu wrote:

   
Official information about uOps and latency SNB+
Author: SEt Date: 2017-07-17 20:41
It looks like Intel released some information about inner workings of Sandy Bridge and newer CPUs: https://reviews.llvm.org/rL307529

Is it indeed accurate? Should instruction tables manual be updated from that information?