Software optimization resources | E-mail subscription to this blog | www.agner.org
Threaded View | Search | List | List Messageboards | Help |
Optimization manuals updated |
---|
Author: Agner | Date: 2013-09-04 11:10 |
The optimization manuals at www.agner.org/optimize/#manuals have now been updated. The most important additions are:
Some interesting test results for the newly tested processors: AMD Piledriver
AMD Jaguar
Intel Ivy Bridge
Intel Haswell
|
Reply To This Message |
Optimization manuals updated |
---|
Author: Agner | Date: 2014-02-19 05:15 |
The optimization manuals at www.agner.org/optimize/#manuals have now been updated with test of the AMD Steamroller microprocessor. There are also minor additions regarding the forthcoming AVX-512 instruction set. I have not tested the Intel Silvermont/Bay Trail processor yet because the test machine I have access to cannot run Linux, and the kind of tests that I want to do are very difficult to do under Windows. Test results for AMD Steamroller
|
Reply To This Message |
Latency of PTEST/VPTEST |
---|
Author: | Date: 2014-05-20 03:05 |
Hi Agner --- I noticed that the Intel documentation at https://software.intel.com/sites/landingpage/IntrinsicsGuide/ shows "VPTEST ymm, ymm" as having a latency of 4 cycles on Haswell, up from 2 on Sandy and Ivy Bridge. They also list "PTEST xmm, xmm" as having a latency of 2 on all platforms. Your current guide shows a latency of 1 for "PTEST x,x" on Sandy Bridge, and 2 for "PTEST v,v" on Haswell. Are you confident in these measurements, or is it possible that the Intel guide is correct here? Or is this just a terminology difference between PTEST and VPTEST? Thanks! |
Reply To This Message |
Latency of PTEST/VPTEST |
---|
Author: Agner | Date: 2014-05-20 06:23 |
It is impossible to measure the latency of an instruction with one type of registers as input (here YMM) and another type of registers as output (here flags). It is only possible to measure the round trip latency of a series of instructions ending with the same type of registers as it started with. The fact that the upper and lower half of a 256-bit register may have different latencies makes this even more difficult. I will have to improve the measurement of VPTEST in the next round of tests, but it is probably right that PTEST has higher latency for YMM registers than for XMM registers. |
Reply To This Message |
Optimization manuals updated - Silvermont test |
---|
Author: Agner | Date: 2014-08-08 06:02 |
My manuals have now finally been updated with a test of Intel's Silvermont (Bay Trail) processor. Intel's old low-power processor named Atom has now finally got a major update after several years in service. The new design called Silvermont is a small low-power processor intended mainly for mobile devices and as a competitor for ARM machines. The Silvermont still contains traces of the old Atom design, but almost everything has been improved or redesigned. The chip has one or more units with two cores each. The two cores in a unit share the same level-2 cache but they have separate execution resources. Thus, there is no competition for execution resources between threads. The Silvermont supports the SSE4.2 instruction set, but not AVX and AVX2. It has a throughput of two instructions per clock cycle. There are two execution pipes for integer instructions, two for floating point and vector instructions, and one for memory read and write. Internal buses and execution units are 128 bits wide. Most execution units are pipelined, but some operations are staying in the same pipeline stage for two (rarely four) clock cycles in the cases of large data sizes or high precision. The high end processors from Intel and AMD have powerful capabilities for out-of-order execution, while the old Atom executes all instructions in program order. The Silvermont is a compromise between these two. It has some out-of-order execution, but not much. Integer instructions in general purpose registers can execute out of order with a depth of at most eight instructions. Floating point and vector instructions cannot execute out of order with other instructions in the same of the two floating point pipelines. There is full register renaming. The cache size is reasonable: 32kB level-1 code, 24kB level-1 data, 1MB level-2. Cache latencies were 3 and 19 clock cycles in my measurements, and the cache performance is generally good. The whole design seems well proportioned with reasonable capacities for a low-power chip in all stages of the pipeline - except for one very big bottleneck: the decoders. Simple instructions can decode at a rate of two instructions per clock cycle, but there are quite a lot of instructions that the decoders cannot handle so smoothly. Instructions that generate more than one micro-operation, as well as instructions with certain combinations of prefixes and escape codes, take four, six or even more clock cycles to decode. In many of my test cases I was unable to determine the latency and throughput of the execution units for certain instructions because the decoders were far behind the execution units. The designers have already removed the common bottleneck of instruction-length decoding by marking instruction boundaries in the code cache (a technique that Intel haven't used since the Pentium MMX seventeen years earlier). It should be possible to remove the unfortunate bottleneck in the decoders without sacrificing too much power consumption. Let's hope that Intel will have solved this problem in the next version of the Silvermont, as well as in the forthcoming Knights Landing coprocessor, which is rumored to be based on the Silvermont architecture. Other news in my manuals include calling conventions for the forthcoming AVX512 instruction set, and an update on how to circumvent Intel's CPU dispatcher for Intel compiler version 14. |
Reply To This Message |
Optimization manuals updated - Silvermont test |
---|
Author: Tacit Murky | Date: 2014-08-11 07:33 |
Hi, Agner. You've mentioned, that there's only one L1D cache access port, which can be a serious bottleneck. However, according to your timing table, "ADD m,r" instruction is fully pipelined, that can only be explained by 2 ports (read+write) working simultaneously. AIDA64's InstLat readout supports this -- see instlatx64.atw.hu for "Bay Trail" or "Avoton" cores. Moreover, unaligned reads and even writes (up to 16 B) are pipelined as well. What are your results for unaligned accesses? Descriptions of "partial OoO" and other things are here: www.realworldtech.com/silvermont/ . Kanter usually gets such info directly from design team. |
Reply To This Message |
Optimization manuals updated - Silvermont test |
---|
Author: Agner | Date: 2014-08-13 05:23 |
You are right. Thank you. There is no penalty for unaligned read or write unless a cache line boundary is crossed. |
Reply To This Message |
Conditional operation |
---|
Author: Just_Coder | Date: 2014-09-20 17:28 |
When do you think Intel could add conditional operation prefix (if ever) ? It is rather surprising and stupid that they did not do it yet - it would solve a lot of problems with optimized branching. |
Reply To This Message |
Conditional operation |
---|
Author: Agner | Date: 2014-09-21 01:34 |
Just_Coder wrote:When do you think Intel could add conditional operation prefix?I agree that predicated instructions can be useful for avoiding the large branch misprediction penalties. Some instruction sets have this, but so far the x86 family has only conditional move. The forthcoming AVX512 instruction set has conditional execution of almost all vector instructions, on a per-element basis. See www.agner.org/optimize/blog/read.php?i=288 Microprocessors with AVX512 are expected some time next year. |
Reply To This Message |
Conditional operation |
---|
Author: Slacker | Date: 2014-10-06 16:34 |
My guess is that CMOV is already taking care of this well enough. Conditional prefixes would take too much space and bloat the code - hitting bottlenecks in decoders and icache. A few CMOVs at the end of an if-else-endif block take less space than a prefix before each instruction inside. Also notice that for some reason Intel cores seem to be limited to 2 inputs per decoded µop. Adding an extra dependency (the condition register) would necessitate adding a µop to most prefixed instructions. Not good. Speaking of which, this would require keeping the condition in a register at all times. Most x86 integer instructions modify the flags, so you can't keep the condition there. The x86 arch isn't exactly overflowing with registers, to waste 'em like that. BTW, other CPU architectures are turning away from condition codes recently, for similar reasons. ARM has got rid of them in newer ISA versions (Thumb(-2) and ARMv8), replacing them with an "if-then-else" instruction. Maybe we will someday see a similar instruction in x86. |
Reply To This Message |
Optimization manuals updated |
---|
Author: Slacker | Date: 2014-10-06 16:45 |
Random finding: It seems the POPCNT instruction has a false dependency on its *output* register on Intel CPUs. At least it does on my Sandy Bridge and on my friend's Haswell. Damn you, Intel! |
Reply To This Message |
Optimization manuals updated |
---|
Author: jenya | Date: 2014-10-10 07:49 |
GCC Bugzilla - Bug 62011 - False Data Dependency in popcnt instruction https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011 |
Reply To This Message |
FP pipelines on Intel's Haswell core |
---|
Author: | Date: 2014-10-17 09:19 |
Agner wrote: * There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications. McCalpin's Comments: On a related note, the latency and throughput numbers for FP divide on various Intel processors suggest that 128-bit FP divide operations perform two 64-bit divides in parallel. (Presumably taking the same number of iterative steps on both values, even if one could have an "early out".) For AVX on Sandy Bridge, Ivy Bridge, and Haswell the reciprocal throughput for the 256-bit FP divide instructions is twice the value for the 128-bit FP divide instructions. This suggests that only one 128-bit "lane" of the FP unit on Port 0 actually supports FP division, and that 256-bit FP operations are performed internally as a sequence of two 128-bit (2-way parallel) FP divide instructions. You show this in the instruction tables as 1 uop on Port 0 for 128-bit FP divide and 2 uops on Port 0 for 256-bit divide, but I had not seen anyone comment specifically on the absence of FP divide throughput speedup on AVX before, so I thought I would bring it up. Considering this lack of speedup with 256-bit AVX makes one wonder if the 512-bit FP divide instruction in AVX-512 will support higher throughput, or if they will leave the HW implementation where it is and emphasize the SW-pipelined approach (currently used by Xeon Phi, for example). By the time you get to 8-element vectors, the SW approach is almost certainly faster if you don't have to reach full 0.5 ulp precision. |
Reply To This Message |
FP pipelines on Intel's Haswell core |
---|
Author: Agner | Date: 2014-10-18 01:52 |
John D. McCalpin wrote:.. wonder if the 512-bit FP divide instruction in AVX-512 will support higher throughputHistorically, the first CPU that supported a new vector size have often split it in two, using a half-size execution unit twice. The AVX-512 instruction set is a huge extension using a lot of silicon. I would not be surprised if the future Intel Skylake processor would economize and split some 512-bit operations into 2x256 bits. This makes sense since the software that supports a new instruction set typically lags at least a few years behind. |
Reply To This Message |
FP pipelines on Intel's Haswell core |
---|
Author: | Date: 2015-09-24 20:54 |
John D. McCalpin wrote:[...] Given the 3-cycle latency of FP Add and the 5-cycle latency of both FP Multiply and FP Fused-Multiply-Add, it seems reasonable to speculate that Intel only wanted to add the extra complexity of an "early out" mechanism on one execution port (Port 1). With no need to change the latency, it is trivial to support an isolated Multiply on either Multiply-Add pipeline. Also note that the other FP execution port (Port 0) is already burdened with the logic for FP divides, which is fairly extensive. [...]I guess Intel has decided as well to move the Integer SIMD Multiplier from Port 1, back to Port 0, to alleviate some of the SIMD stress on it, as it (Port 1) still keeps the sole "pure" FP-Add execution unit, even on a cost of agreggating a higher complexity (and more Integer stress/workload) back on Port 0. Interestingly , keeping the Integer SIMD Multipler within the same Port as the Divider in the P6 family, is a thing that hasn't been seen since the Pentium III/Pentium M days. |
Reply To This Message |
FP pipelines on Intel's Haswell core |
---|
Author: Agner | Date: 2015-09-25 00:45 |
Jorcy de Oliveira Neto wrote: I guess Intel has decided as well to move the Integer SIMD Multiplier from Port 1, back to Port 0, to alleviate some of the SIMD stress on it, as it (Port 1) still keeps the sole "pure" FP-Add execution unit, even on a cost of agreggating a higher complexity (and more Integer stress/workload) back on Port 0. They always put instructions with the same latency on the same port, even if that port is overused. |
Reply To This Message |
Micro-fusion limited to 1-reg addressing modes |
---|
Author: | Date: 2015-07-11 21:39 |
uop micro-fusion on Intel SnB seems to be possible only when it doesn't create uops with more than 2 input dependencies. Intel's code analyzer (IACA, from https://software.intel.com/en-us/articles/intel-architecture-code-analyzer) knows about this, and real experiments on Sandybridge hardware confirm that it's real: See my answer to stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes. I didn't see any mention of this in your optimization manual or microarchitecture docs. I tested again with store instructions, as that's an example used in your microarch doc, and it seems they can only fuse when 1-reg addressing modes are used. For example, mov [rsi + 0 + rdi], eax ; produces as many fused as unfused uops. mov [rsi + 0], eax ; produces 1 fused-domain, 2 unfused-domain uops I'm doing all my testing on 64bit Linux, on an i5 2500k (SnB). I just tested a 32bit binary, and got the same results, since your example did use 32bit registers. Same result: mov [esi+edi], eax can't micro-fuse. Assembled/linked with: In the Core2/Nehalem section of your architecture guide, you say: I think this is wrong. I haven't tested Core2 or Nehalem, just SnB, but the SnB/IvB section simply refers back to the Nehalem section without mentioning any caveats. I'm sure it's wrong for SnB. IACA with -arch NHM doesn't show micro-fusion for 2-reg addresses for stores, or ALU ops, so this needs testing on Nehalem hardware, too. (IACA can't analyse for pre-Nehalem arches.)
That's about the only bad thing I can say about your work, though! Overall, it's an amazing resource. Making each section stand alone would bloat things, and make it less obvious when things were the same for multiple CPUs, so that wouldn't be good, either. |
Reply To This Message |
Micro-fusion limited to 1-reg addressing modes |
---|
Author: Agner | Date: 2015-07-12 00:42 |
Peter Cordes wrote:
uop micro-fusion on Intel SnB seems to be possible only when it doesn't create uops with more than 2 input dependencies.Thanks for the info and the link. I haven't tested uop fusion on Intel processors since the uop cache was introduced, so it may be that this is a drawback of the uop cache. I don't have the time to test it right now, but I will add a test for uop fusion next time I update my test scripts. Have you tried with FMA instructions? |
Reply To This Message |
Micro-fusion limited to 1-reg addressing modes |
---|
Author: | Date: 2015-11-15 12:23 |
This is a limitation of IDQ, so it happens after either decoder or uop-cache. Unlike Nhm, SnB introduced new rule for IDQ: microfused uops with index reg (and some other rare types) are unfused when writing to IDQ. Even if you have 4 of them per clock, IDQ would get all 8 «uop halfs» at once, but (as with any uops) can read just 4/cl. Reason: IDQ, ROB and RS buffers and queues have reduced uop format (for energy efficiency) without 6-bit (renamed) index field. There is no changes to this in IvB, but there may be in Hwl. |
Reply To This Message |
Micro-fusion limited to 1-reg addressing modes |
---|
Author: Agner | Date: 2015-12-01 09:07 |
Peter Cordes wrote:uop micro-fusion on Intel SnB seems to be possible only when it doesn't create uops with more than 2 input dependencies.I have now tested this on Sandy Bridge, Ivy Bridge, Haswell and Broadwell. I have not had access to test on a Skylake yet. The results show that instructions with three input dependencies are fusing alright and use only a single entry in the micro-operation cache, unless they contain more than 32 bits of address and immediate data. Instructions with more than 32 bits of data are still fusing, but use two entries in the micro-operation cache. It is possible to make instructions with four input dependencies on Haswell and Broadwell, using the fused multiply-and-add instructions. These are still fusing alright and use only a single entry in the micro-operation cache. Instructions with both rip-relative addressing and immediate data do not fuse. |
Reply To This Message |
Micro-fusion limited to 1-reg addressing modes |
---|
Author: | Date: 2015-12-15 20:25 |
Agner wrote:
Peter Cordes wrote:I guess you didn't see my response on Stackoverflow to your 2nd answer there. Our test results disagree. I see a change in the uop perf counters, and an increase in the clock cycles taken, when changing from or eax, [rsi] to or eax, [rsi+rdi]. I didn't try to measure uop-cache slots, just a total cycle count, and fused/unfused uop counts. My full test code, and the Linux perf command I used to get data from the performance counters, is posted on stackoverflow. Based on Tacit Murky's information that SnB's internal uop format doesn't have room for a micro-fused index register, maybe 2-register addressing modes can still micro-fuse in the uop cache, but not in the pipeline where the ROB tracks them. Did your test results make an assumption about uops in the uop cache being the same as fused-domain uops in the pipeline? I re-ran my test after seeing your response, and I'm still sure I'm seeing 2-register source operands NOT micro-fusing. If I'm wrong, can you please have a look and help me figure out what's wrong with my test procedure? I've been using I'm essentially testing fused-domain uops against the 4-wide limit of the pipeline for issuing / retiring 4 fused-domain uops per clock. Some of my fused-domain uops are NOPs, to avoid execution port unfused-domain bottlenecks on SnB. |
Reply To This Message |
Micro-fusion limited to 1-reg addressing modes |
---|
Author: | Date: 2016-05-24 07:15 |
Agner wrote:
The results show that instructions with three input dependencies are fusing alright and use only a single entry in the micro-operation cache I found official confirmation in Intel's optimization manual that we're both right (see Section 2.3.2.4: "Micro-op Queue and the Loop Stream Detector (LSD)"), we were just measuring different things. SnB-family still micro-fuses such instructions in the decoders and uop-cache, but "un-laminates" uops with an indexed addressing mode before the issue/rename stage. The uop format used in the ROB must be different from the format in the uop cache. The unfused-domain scheduler (RS) must still handle uops with indexed addressing modes, because pure loads with complex addressing modes are still a single p23 uop. Tacit Murky's earlier post says that the un-lamination happens as uops are written to the IDQ, so the loop-buffer size is measured in un-laminated uops. For the purposes of pipeline width and tight loops, indexed addressing modes don't micro-fuse. The 4-wide issue width is after un-lamination. For the record, un-laminate is not a normal English word, but delaminate is. I want to put quotes around it every time I type it. >.< BTW, I updated my answer on StackOverflow with this info. |
Reply To This Message |
Skylake? |
---|
Author: Travis | Date: 2015-10-21 17:55 |
Any plans to update the manual with skylake information? So far the concrete information on the Haswell/Broadwell -> Skylake differences has been pretty thin, so it would be awesome to see an update of what many of us consider the canonical source for this info. |
Reply To This Message |
Skylake? |
---|
Author: Agner | Date: 2015-10-22 00:36 |
Travis wrote:Any plans to update the manual with skylake information?I am waiting for the alleged "Xeon" version of Skylake with the new AVX-512 instruction set. Does anybody have rumors about this? If you have a Skylake then I can test it by remote access if you will allow me. I am also missing the Broadwell. (My remote testing requires Linux). |
Reply To This Message |
Skylake? |
---|
Author: | Date: 2015-10-22 13:07 |
The Skylake Xeons that have been announced use the "client" Skylake core (and uncore) and do not support AVX-512. This is similar to the previous mixing of "client" and "server" uncores across Core i7 and Xeon E3 platforms, but of course this is the first time that it impacts the ISA. The last 3 major core releases have had ~18 months between the release of the first "client" part and the release of the 2-socket server part, so if that schedule is repeated we should see the Skylake Xeon EP in 1H2017. Earlier seems unlikely, since we have not seen the "Broadwell EP" yet. Later is certainly possible, since this is the first time that there has been a major ISA-level difference between the "client" and "server" core. |
Reply To This Message |
Skylake? |
---|
Author: | Date: 2015-10-23 05:28 |
During this year there were various leaks on many sites about the roadmaps for Intel Xeon, and all of them indicated that Skylake Xeon with AVX-512 (the "Purley" platform) will be introduced some time in 2017, exactly like you have assumed. Besides AVX-512, it is known that those Skylake Xeon processors will have a large number of other improvements, e.g. a new socket with 6 memory channels, possibly the same socket that will be used by Knights Landing. Knights Landing, the first processor with AVX-512, is rumored to have been delayed until Q3 2016, so it will probably be introduced about a half year earlier than Skylake with AVX-512. In conclusion, I hope that Agner will not wait to test Skylake until AVX-512 becomes available, because that will not happen any time soon. From the Intel optimization manual, Skylake does not seem to have any dramatic differences compared to Broadwell & Haswell, unlike the jump between Ivy Bridge & Haswell, or between Nehalem & Sandy Bridge. Nevertheless, a detailed testing might uncover interesting features.
Xeon E3-1535M v5 will become available in the new models of mobile workstations announced by Lenovo and Dell. I will buy one ASAP, but it seems that those mobile workstations will be offered only in December or maybe even only in January, where I live. I will use Linux on it, and, if by that time Skylake would not be tested yet, I could easily offer remote access to it. Nevertheless, there are still 3 months until then, so maybe another kind soul will provide access to Skylake earlier. |
Reply To This Message |
Skylake? |
---|
Author: | Date: 2015-10-23 17:59 |
The currently announced SkyLake Xeon CPUs are the E3 parts that are virtually the same as client i3-i7 4C+GPU parts on LGA1151 (AFAIK). The *real* Xeon chips, the E5 parts (probably SkyLake-E) will come later, hopefully with AVX-512 (I hope there will be core-i7 CPUs based on SkyLake-E, with 6-8 core like it happened with Haswell-E). Since there is no real competition in HEDT for Intel, we have to wait for server parts for a major throughput increase. |
Reply To This Message |
Skylake? |
---|
Author: Slacker | Date: 2015-10-24 02:11 |
You're also missing Excavator and Puma cores on the AMD side. I hope you will get a chance to test those soon! |
Reply To This Message |
Excavator and Puma |
---|
Author: Agner | Date: 2015-12-16 01:01 |
Slacker wrote:You're also missing Excavator and Puma cores on the AMD side. I hope you will get a chance to test those soon!I can't find a motherboard for these anywhere. I am not going to buy a whole computer just to make a few tests. |
Reply To This Message |
Excavator and Puma |
---|
Author: Slacker | Date: 2016-01-03 22:02 |
That's going to be a problem. AMD explicitly skipped desktop versions of either, preferring to keep older cores serving that segment. Unless some manufacturer decides to make a mobo with a (probably overpriced) mobile APU baked in - which is not bloody likely - such mobo is not going to exist. Right now, the cheapest way to get your hands on a Puma seems to be this barebone. Still kinda pricey though. Then again - maybe you could buy a whole computer just to make a few tests - and then sell it off? Carrizo laptops are on the market long enough for some used units to be occasionally floating on fleaBay. Should you buy one, you should be able to resell it at minimal (if any) loss. |
Reply To This Message |
Excavator and Puma |
---|
Author: | Date: 2016-01-16 07:43 |
You should wait for Bristol Ridge an test it on some all new AM4 board with DDR4 or better yet pickup some FM2+ board with proper bios update and put in an Athlon X4 845 (which is Carrizo with GPU disabled) cdn.wccftech.com/wp-content/uploads/2016/01/AMD-Carrizo-Desktop-APUs.jpg
|
Reply To This Message |
Excavator and Puma |
---|
Author: | Date: 2016-02-02 13:11 |
Good news: AMD just released the Athlon X4 845, which uses Excavator cores, for the FM2+ socket. www.anandtech.com/show/10009/amd-launches-excavator-on-desktop-the-65w-athlon-x4-845-for-70 |
Reply To This Message |
Threaded View | Search | List | List Messageboards | Help |