Agner`s CPU blog

Test results for Broadwell and Skylake

Author: Agner

Date: 2015-12-26 08:27

The optimization manuals at www.agner.org/optimize/#manuals have been updated. I have now tested the Intel Broadwell and Skylake processors. I have not tested the AMD Excavator and Puma because I cannot find suitable motherboards for testing them.

The test results show that the pipeline and execution units in Broadwell is very similar to its predecessor Haswell, while the Skylake has been reorganized a little.

The Skylake has a somewhat improved cache throughput and supports the new DDR4 RAM. This is important since RAM access is the bottleneck in many applications. On the other hand, the Skylake has reduced the level-2 cache associativity from 8 to 4.

Floating point division has been improved a little in Broadwell and integer division has been improved a little in Skylake. Gather instructions, which are used for collecting non-contiguous data from memory and joining them into a vector register, are improved somewhat in Broadwell, and a little more in Skylake. This makes it more efficient to collect data into vector registers.

Ever since the first Intel processor with out-of-order execution was released in 1995, there has been a limitation that no micro-operation could have more than two input dependencies. This meant that instructions with more than two input dependencies were split into two or more micro-operations. The introduction of fused multiply-and-add (FMA) instructions in Haswell made it necessary to overcome this limitation. Thus, the FMA instructions were the first instructions to be implemented with micro-operations with three input dependencies in an Intel processor. Once this limitation has been broken, the new capability can also be applied to other instructions. The Broadwell has extended the capability for three-input micro-operations to add-with-carry, subtract-with-borrow and conditional move instructions. The Skylake has extended it further to a blend instruction. AMD processors have never had this limitation of two input dependencies. Perhaps this is the reason why AMD came before Intel with FMA instructions.

The Haswell and Broadwell have two execution units for floating point multiplication and FMA, but only one for addition. This is odd since most floating point code has more additions than multiplications. To get the maximum floating point throughput on these processors, one might have to replace some additions with FMA instructions with a multiplier of 1. Fortunately, the Skylake has fixed this imbalance and made two floating point arithmetic units, both of which can handle both addition, multiplication and FMA. This gives a maximum throughput of two floating point vector operations per clock cycle.

The Skylake has increased the number of execution units for integer vector arithmetic from two to three. In general, the Skylake now has multiple execution units for almost all common operations (except memory write and data permutations). This means that an instruction or micro-operation rarely has to wait for a vacant execution unit. A throughput of four instructions per clock cycle is now a realistic goal for CPU-intensive code, unless the software contains long dependency chains. All arithmetic and logic units support vectors of up to 256 bits. The anticipated support for 512-bit vectors with the AVX-512 instruction set has been postponed to 2016 or 2017.

Intel's design has traditionally tried to standardize operation latencies, i. e. the number of clock cycles that a micro-operation takes. Operations with the same latencies were organized under the same execution port in order to avoid a clash when operations that start at different times would finish at the same time and so need the result bus at the same time. The Skylake microarchitecture has been improved to allow operations with several different latencies under the same execution port. There is still some standardization of latencies left, though. All floating point additions, multiplications and FMA operations have a latency of 4 clock cycles on Skylake, while these were 3 and 5 on previous processors.

Store forwarding is one clock cycle faster on Skylake than on previous processors. Store forwarding is the time it takes to read from a memory address immediately after writing to the same address.

Previous Intel processors have different states for code that use the AVX instruction sets allowing 256-bit vectors versus legacy code with 128-bit vectors and no VEX prefixes. The Sandy Bridge, Ivy Bridge, Haswell and Broadwell processors all have these states and a serious penalty of 70 clock cycles for state switching when a piece of code accidentally mixed VEX and non-VEX instructions. This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 Âµs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector. The result is that the throughput for 256-bit vectors is 4-5 times slower during this warm-up period. If you know in advance that you will need to use 256-bit instructions soon, then you can start the warm-up process by placing a dummy 256-bit instruction at a strategic place in the code. My measurements showed that the upper half of the units is shut down again after 675 Âµs of inactivity.

This warm-up phenomenon has reportedly been observed in previous processors as well (see agner.org/optimize/blog/read.php?i=378#378), but I have not observed it before in any of the processors that I have tested. Perhaps some high-end versions of Intel processors have this ability to shut down the upper 128-bit lane in order to save power, while other variants of the same processors have no such feature. This is something that needs further investigation.

Reply To This Message

Next Message