Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Optimization manuals updated - Agner - 2013-09-04
reply Optimization manuals updated - Agner - 2014-02-19
replythread Latency of PTEST/VPTEST - Nathan Kurz - 2014-05-20
last reply Latency of PTEST/VPTEST - Agner - 2014-05-20
replythread Optimization manuals updated - Silvermont test - Agner - 2014-08-08
last replythread Optimization manuals updated - Silvermont test - Tacit Murky - 2014-08-11
last reply Optimization manuals updated - Silvermont test - Agner - 2014-08-13
replythread Conditional operation - Just_Coder - 2014-09-20
last replythread Conditional operation - Agner - 2014-09-21
last reply Conditional operation - Slacker - 2014-10-06
replythread Optimization manuals updated - Slacker - 2014-10-06
last reply Optimization manuals updated - jenya - 2014-10-10
last replythread FP pipelines on Intel's Haswell core - John D. McCalpin - 2014-10-17
last reply FP pipelines on Intel's Haswell core - Agner - 2014-10-18
 
Optimization manuals updated
Author: Agner Date: 2013-09-04 11:10

The optimization manuals at www.agner.org/optimize/#manuals have now been updated. The most important additions are:

  • AMD Piledriver and Jaguar processors are now described in the microarchitecture manual and the instruction tables.
  • Intel Ivy Bridge and Haswell processors are now described in the microarchitecture manual and the instruction tables.
  • The micro-op cache of Intel processors is analyzed in more detail
  • The assembly manual has more information on the AVX2 instruction set.
  • The C++ manual describes the use of my vector classes for writing parallel code.

Some interesting test results for the newly tested processors:

AMD Piledriver

  • Similar microarchitecture to Bulldozer
  • Supports fused multiply-and-add instructions in both the FMA3 and FMA4 form. FMA3 is compatible with Intel processors. See Wikipedia for a discussion of the incompatibility between these instruction sets.
  • The throughput of FMA3 instructions is only half as much as the throughput of FMA4 instructions, even though they are doing exactly the same calculations.
  • Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes. No explanation for this has been found. This design flaw is likelty to negate any advantage of using the AVX instruction set.
  • The problems with cache performance on the Bulldozer seem to have been fixed in the Piledriver

AMD Jaguar

  • Similar microarchitecture to Bobcat
  • Supports the AVX instruction set
  • Does not support AMD's 3DNow and XOP instruction sets. This is OK with me since few programmers would care to make a special version of their code specifically for AMD processors.
  • The vector execution units are doubled in size from 64 bits in Bobcat to 128 bits in Jaguar. The throughput of vector instructions is doubled. Floating point scalar (non-vector) performance was quite good already on the Bobcat and is unchanged on the Jaguar.
  • Load and store units are also doubled from 64 bits to 128 bits.
  • Store-to-load forwarding is much faster than on Bobcat
  • The prefetch instruction is particularly slow on Jaguar. The throughput is much lower than on other AMD processors.
  • Integer division is improved
  • Register moves with vector registers are eliminated if the register is known by the processor to be zero. Register moves are not eliminated if the value of the register is unknown. This seems to indicate that registers are not allocated if they are known to be zero.
  • The VMASKMOVPS instruction with a memory source operand takes more than 300 clock cycles on the Jaguar when the mask is zero, in which case the instruction should do nothing. This appears to be a design flaw. This instruction is not very common, though.

Intel Ivy Bridge

  • Similar microarchitecture to Sandy Bridge
  • Can eliminate register-to-register moves by renaming the target register
  • Problem with decoding long NOPs in Sandy Bridge has been fixed
  • Some execution units have been moved to a different port
  • Handling of partial registers is improved
  • The prefetch instructions are particularly slow on Ivy Bridge. The throughput is much lower than on other Intel processors.
  • Store-to-load forwarding is generally good, but in some unfortunate cases of an unaligned 256-bit read after a smaller write, there is an unusually large delay of more than 200 clock cycles.

Intel Haswell

  • Supports the new AVX2 instruction set which allows integer vectors of 256 bits and gather instructions
  • Supports fused multiply-and-add instructions of the FMA3 type
  • The cache bandwidth is doubled to 256 bits. It can do two reads and one write per clock cycle.
  • Cache bank conflicts have been removed
  • The number of read and write buffers, register files, reorder buffer and reservation station are all bigger than in previous processors
  • There are more execution units and one more execution port than on previous processors. This makes a throughput of four instructions per clock cycle quite realistic in many cases.
  • The throughput for not-taken branches is doubled to two not-taken branches per clock cycle, including fused branch instructions. The throughput for taken branches is largely unchanged.
  • There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications. But at least it enables Intel to boast a floating point performance of 32 FLOPS per clock cycle.
  • The fused multiply-and-add operation is the first case in the history of Intel processors of micro-ops having more than two input dependencies. Other instructions with more than two input dependencies are still split into two micro-ops, though. AMD processors don't have this limitation.
  • The delays for moving data between different execution units is smaller than on previous Intel processors in many cases.
   
Optimization manuals updated
Author: Agner Date: 2014-02-19 05:15

The optimization manuals at www.agner.org/optimize/#manuals have now been updated with test of the AMD Steamroller microprocessor.

There are also minor additions regarding the forthcoming AVX-512 instruction set.

I have not tested the Intel Silvermont/Bay Trail processor yet because the test machine I have access to cannot run Linux, and the kind of tests that I want to do are very difficult to do under Windows.

Test results for AMD Steamroller

  • Similar microarchitecture to Bulldozer and Piledriver
  • Has one instruction decoder per thread, where previous designs shared a decoder between two threads. This removes a potential bottleneck.
  • Instruction fetch is still shared between two threads. This is a likely bottleneck.
  • Instruction cache increased by 50%
  • New loop buffer can store at least 32 decoded instructions. The exact size is not known
  • Improved throughput for level-2 cache write
  • Store forwarding improved. Supports small read after bigger write
  • Floating point / vector unit redesigned with three pipes where previous designs had four pipes
  • Maximum throughput is four instructions per clock when integer and vector instructions are mixed
  • Floating point division improved
  • No penalty for floating point denormal and underflow results
  • Some performance flaws in Piledriver have been fixed. Most importantly, 256-bit stores are now performing well
  • A new performance flaw has been added, though. Floating point vector addition has lower throughput than expected
  • Supports AVX, but not AVX2
   
Latency of PTEST/VPTEST
Author:  Date: 2014-05-20 03:05
Hi Agner ---

I noticed that the Intel documentation at https://software.intel.com/sites/landingpage/IntrinsicsGuide/ shows "VPTEST ymm, ymm" as having a latency of 4 cycles on Haswell, up from 2 on Sandy and Ivy Bridge. They also list "PTEST xmm, xmm" as having a latency of 2 on all platforms.

Your current guide shows a latency of 1 for "PTEST x,x" on Sandy Bridge, and 2 for "PTEST v,v" on Haswell. Are you confident in these measurements, or is it possible that the Intel guide is correct here? Or is this just a terminology difference between PTEST and VPTEST?

Thanks!

   
Latency of PTEST/VPTEST
Author: Agner Date: 2014-05-20 06:23
It is impossible to measure the latency of an instruction with one type of registers as input (here YMM) and another type of registers as output (here flags). It is only possible to measure the round trip latency of a series of instructions ending with the same type of registers as it started with. The fact that the upper and lower half of a 256-bit register may have different latencies makes this even more difficult. I will have to improve the measurement of VPTEST in the next round of tests, but it is probably right that PTEST has higher latency for YMM registers than for XMM registers.
   
Optimization manuals updated - Silvermont test
Author: Agner Date: 2014-08-08 06:02
My manuals have now finally been updated with a test of Intel's Silvermont (Bay Trail) processor.

Intel's old low-power processor named Atom has now finally got a major update after several years in service. The new design called Silvermont is a small low-power processor intended mainly for mobile devices and as a competitor for ARM machines.

The Silvermont still contains traces of the old Atom design, but almost everything has been improved or redesigned. The chip has one or more units with two cores each. The two cores in a unit share the same level-2 cache but they have separate execution resources. Thus, there is no competition for execution resources between threads.

The Silvermont supports the SSE4.2 instruction set, but not AVX and AVX2. It has a throughput of two instructions per clock cycle. There are two execution pipes for integer instructions, two for floating point and vector instructions, and one for memory read and write. Internal buses and execution units are 128 bits wide. Most execution units are pipelined, but some operations are staying in the same pipeline stage for two (rarely four) clock cycles in the cases of large data sizes or high precision.

The high end processors from Intel and AMD have powerful capabilities for out-of-order execution, while the old Atom executes all instructions in program order. The Silvermont is a compromise between these two. It has some out-of-order execution, but not much. Integer instructions in general purpose registers can execute out of order with a depth of at most eight instructions. Floating point and vector instructions cannot execute out of order with other instructions in the same of the two floating point pipelines. There is full register renaming.

The cache size is reasonable: 32kB level-1 code, 24kB level-1 data, 1MB level-2. Cache latencies were 3 and 19 clock cycles in my measurements, and the cache performance is generally good.

The whole design seems well proportioned with reasonable capacities for a low-power chip in all stages of the pipeline - except for one very big bottleneck: the decoders. Simple instructions can decode at a rate of two instructions per clock cycle, but there are quite a lot of instructions that the decoders cannot handle so smoothly. Instructions that generate more than one micro-operation, as well as instructions with certain combinations of prefixes and escape codes, take four, six or even more clock cycles to decode. In many of my test cases I was unable to determine the latency and throughput of the execution units for certain instructions because the decoders were far behind the execution units. The designers have already removed the common bottleneck of instruction-length decoding by marking instruction boundaries in the code cache (a technique that Intel haven't used since the Pentium MMX seventeen years earlier). It should be possible to remove the unfortunate bottleneck in the decoders without sacrificing too much power consumption. Let's hope that Intel will have solved this problem in the next version of the Silvermont, as well as in the forthcoming Knights Landing coprocessor, which is rumored to be based on the Silvermont architecture.

Other news in my manuals include calling conventions for the forthcoming AVX512 instruction set, and an update on how to circumvent Intel's CPU dispatcher for Intel compiler version 14.

   
Optimization manuals updated - Silvermont test
Author: Tacit Murky Date: 2014-08-11 07:33
Hi, Agner.
You've mentioned, that there's only one L1D cache access port, which can be a serious bottleneck. However, according to your timing table, "ADD m,r" instruction is fully pipelined, that can only be explained by 2 ports (read+write) working simultaneously. AIDA64's InstLat readout supports this -- see instlatx64.atw.hu for "Bay Trail" or "Avoton" cores. Moreover, unaligned reads and even writes (up to 16 B) are pipelined as well. What are your results for unaligned accesses?

Descriptions of "partial OoO" and other things are here: www.realworldtech.com/silvermont/ . Kanter usually gets such info directly from design team.

   
Optimization manuals updated - Silvermont test
Author: Agner Date: 2014-08-13 05:23
You are right. Thank you.

There is no penalty for unaligned read or write unless a cache line boundary is crossed.

   
Conditional operation
Author: Just_Coder Date: 2014-09-20 17:28
When do you think Intel could add conditional operation prefix (if ever) ? It is rather surprising and stupid that they did not do it yet - it would solve a lot of problems with optimized branching.
   
Conditional operation
Author: Agner Date: 2014-09-21 01:34
Just_Coder wrote:
When do you think Intel could add conditional operation prefix?
I agree that predicated instructions can be useful for avoiding the large branch misprediction penalties. Some instruction sets have this, but so far the x86 family has only conditional move. The forthcoming AVX512 instruction set has conditional execution of almost all vector instructions, on a per-element basis. See www.agner.org/optimize/blog/read.php?i=288
Microprocessors with AVX512 are expected some time next year.
   
Conditional operation
Author: Slacker Date: 2014-10-06 16:34
My guess is that CMOV is already taking care of this well enough.

Conditional prefixes would take too much space and bloat the code - hitting bottlenecks in decoders and icache. A few CMOVs at the end of an if-else-endif block take less space than a prefix before each instruction inside.

Also notice that for some reason Intel cores seem to be limited to 2 inputs per decoded µop. Adding an extra dependency (the condition register) would necessitate adding a µop to most prefixed instructions. Not good.

Speaking of which, this would require keeping the condition in a register at all times. Most x86 integer instructions modify the flags, so you can't keep the condition there. The x86 arch isn't exactly overflowing with registers, to waste 'em like that.

BTW, other CPU architectures are turning away from condition codes recently, for similar reasons. ARM has got rid of them in newer ISA versions (Thumb(-2) and ARMv8), replacing them with an "if-then-else" instruction. Maybe we will someday see a similar instruction in x86.

   
Optimization manuals updated
Author: Slacker Date: 2014-10-06 16:45
Random finding:

It seems the POPCNT instruction has a false dependency on its *output* register on Intel CPUs. At least it does on my Sandy Bridge and on my friend's Haswell. Damn you, Intel!

   
Optimization manuals updated
Author: jenya Date: 2014-10-10 07:49
GCC Bugzilla - Bug 62011 - False Data Dependency in popcnt instruction https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011
   
FP pipelines on Intel's Haswell core
Author:  Date: 2014-10-17 09:19
Agner wrote:

* There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications.

McCalpin's Comments:
0. Definitely agree on the (typical) excess of FP Add over FP multiply operations. The ratios are mostly between 1:1 and 2:1, depending on the application area. I usually assume 1.5:1 in architectural analyses (while keeping in mind that this is a "fuzzy" estimate).
1. Of course one can always expand an FP Add into an FMA to run in the other pipeline. You need a YMM register to hold the dummy "1.0" multiplier values, but in principle it would not be difficult to teach a compiler this trick, along with suitable cost metrics to decide when to employ it.
2. Given the 3-cycle latency of FP Add and the 5-cycle latency of both FP Multiply and FP Fused-Multiply-Add, it seems reasonable to speculate that Intel only wanted to add the extra complexity of an "early out" mechanism on one execution port (Port 1). With no need to change the latency, it is trivial to support an isolated Multiply on either Multiply-Add pipeline. Also note that the other FP execution port (Port 0) is already burdened with the logic for FP divides, which is fairly extensive.

On a related note, the latency and throughput numbers for FP divide on various Intel processors suggest that 128-bit FP divide operations perform two 64-bit divides in parallel. (Presumably taking the same number of iterative steps on both values, even if one could have an "early out".) For AVX on Sandy Bridge, Ivy Bridge, and Haswell the reciprocal throughput for the 256-bit FP divide instructions is twice the value for the 128-bit FP divide instructions. This suggests that only one 128-bit "lane" of the FP unit on Port 0 actually supports FP division, and that 256-bit FP operations are performed internally as a sequence of two 128-bit (2-way parallel) FP divide instructions. You show this in the instruction tables as 1 uop on Port 0 for 128-bit FP divide and 2 uops on Port 0 for 256-bit divide, but I had not seen anyone comment specifically on the absence of FP divide throughput speedup on AVX before, so I thought I would bring it up.

Considering this lack of speedup with 256-bit AVX makes one wonder if the 512-bit FP divide instruction in AVX-512 will support higher throughput, or if they will leave the HW implementation where it is and emphasize the SW-pipelined approach (currently used by Xeon Phi, for example). By the time you get to 8-element vectors, the SW approach is almost certainly faster if you don't have to reach full 0.5 ulp precision.

   
FP pipelines on Intel's Haswell core
Author: Agner Date: 2014-10-18 01:52
John D. McCalpin wrote:
.. wonder if the 512-bit FP divide instruction in AVX-512 will support higher throughput
Historically, the first CPU that supported a new vector size have often split it in two, using a half-size execution unit twice. The AVX-512 instruction set is a huge extension using a lot of silicon. I would not be surprised if the future Intel Skylake processor would economize and split some 512-bit operations into 2x256 bits. This makes sense since the software that supports a new instruction set typically lags at least a few years behind.