Software optimization resources | E-mail subscription to this blog | www.agner.org
Threaded View | Search | List | List Messageboards | Help |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2011-01-30 11:15 |
I have now got an opportunity to test the new Sandy Bridge processor from Intel, and the results are very interesting. There are many improvements - and few drawbacks. I have updated my manuals with the details, but let me just summarize the main findings here:
My conclusion is that the Sandy Bridge processor has many significant improvements over previous processors. The most serious bottlenecks and weaknesses of previous processors have been removed. The micro-op cache turns out to be an important improvement for relatively small loops. Unfortunately, the poor performance of the decoders has not been improved. This remains a likely bottleneck for code that doesn't fit into the micro-op cache The decoding of instruction lengths has been a problem in Intel processors for many years. They tried to fix the problem with the trace cache in the Pentium 4, which turned out to be a dead end street, and now the apparently more successful micro-op cache in the Sandy Bridge. AMD have solved the problem of detecting instruction lengths in their processors by marking instruction boundaries in the code cache. Intel did the same in the Pentium MMX back in 1996, and it is a mystery to me why they are not using this solution today. There would hardly be a need for the micro-op cache if they had instruction boundaries marked in the code cache. Whenever the narrowest bottleneck of a system is removed then the next less narrow bottleneck becomes visible. This is also the case here. As the memory read bandwidth is doubled, the risk of cache bank conflicts is increased. Cache conflicts was actually the limiting factor in some of my tests. It has struck me that the new Sandy Bridge design is actually under-hyped. I would expect a new processor design with so many improvements to be advertised aggressively, but the new design doesn't even have an official brand name. The name Sandy Bridge is only an unofficial code name. In Intel documents it is variously referred to as "second generation Intel Core processors", "2xxx series", and "Intel microarchitecture code name Sandy Bridge". I have never understood what happens in Intel's marketing department. They keep changing their nomenclature, and they use the same brand names for radically different technical designs. In this case they have no reason to obscure technical differences. How can they cash in on the good reputation of the Sandy Bridge design when it doesn't even have an official name? [Corrected on June 08, 2011, and Mar 2, 2012]. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2011-02-15 11:09 |
Hi Agner - have you noticed that Turbo Boost is much more effective on Sandy Bridge than on Nehalem ? On a 4 core 3.4 GHz SB with all cores working flat out I'm seeing the clock speed staying at 4.3 GHz for 20+ minutes. This seems to suggest that there is 25% extra performance to be had for free, so long as you have sufficient cooling. |
Reply To This Message |
AVX2 |
---|
Author: phis | Date: 2011-06-23 01:13 |
Thanks for your detailed analysis, this is very useful indeed. Have you seen the updated Intel Advanced Vector Extensions Programming Reference (June 2011). There are interesting things in there, including AVX2 (256-bit integer AVX instructions) and some VEX-encoded general-purpose instruction for bit manipulation et al. |
Reply To This Message |
AVX2 |
---|
Author: Agner | Date: 2011-06-23 11:35 |
Thanks for the reference. I always expected that there would be an AVX2 with 256 bit integer vector instructions. The most surprising extension is the VGATHER.. instructions that allow vectorized table-lookup. Lookup tables have always been an obstackle to vectorization. I wonder how efficient it will be, though. The performance will still be limited by the number of address-generation units and read ports in the CPU. The physical random number generator instruction (RDRAND) has been announced previously. It is strongly needed for cryptographic and security applications. The VIA processors have had such an instruction for years now. I will update my "objconv" disassembler with the new instructions when I get the time. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: anon | Date: 2013-08-01 06:26 |
Thank you very much for the good analysis. There is one restriction that isn't mentioned in your document. In Sandy Bridge and later processors, instructions that Macro-op fusion can be applied (add, sub, and, cmp, test, inc, dec) seem to be decoded only with simple decoders (3 of 4). This restriction does not exist in Nehalem or earlier processors. Actually there is a decoded uop cache, and OoO backend executes these instruction in 3 per cycle throughput, so it would have little impact on the real world performance. But it might be a bit different story in Haswell, which has wider execution ports. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2013-08-06 01:47 |
How do you know, Anon? Please don't post unverified claims anonymously. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: anon | Date: 2013-08-07 07:19 |
500 iterations of this code sequence (4,000 instructions, does not fit to uop cache): or rax, 1 runs at 2 clocks / 8 instructions (as expected). But if we change 6 ORs into AND(or other macro-fusable instructions), it drops to 2.5 clocks / 8 instructions. It means that the decoder cannot handle four macro-fusable instructions at the same clock cycle. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2013-08-07 11:11 |
anon wrote:But if we change 6 ORs into AND(or other macro-fusable instructions), it drops to 2.5 clocks / 8 instructions. It means that the decoder cannot handle four macro-fusable instructions at the same clock cycle.I get 2.45 clock on an Ivy Bridge. I get the same for NOT and NEG, which are not fusable. There is nothing the instructions can actually fuse with, though. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: anon | Date: 2013-08-07 11:49 |
Agner wrote:I get the same for NOT and NEG, which are not fusable.Repeating 6 not/neg (2 or 3 bytes x 6) will be affected by predecoder's limitation. To avoid that, this code sequence will be helpful: not rax This runs at 2 clocks / 8 insts. But and rax, rax this doesn't. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2013-08-08 01:38 |
anon wrote:Repeating 6 not/neg (2 or 3 bytes x 6) will be affected by predecoder's limitation.Is there a limitation on decoding short instructions? Is this documented anywhere? I have observed on the Haswell that conditional move instructions, which generate 2 microops, decode at two per clock only when I add prefixes to make the instructions 4 bytes long. This applies also when the microop cache is used. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: anon | Date: 2013-08-08 04:56 |
Agner wrote:
Is there a limitation on decoding short instructions? Is this documented anywhere?I'm not sure if it really is predecoder's limitation. For example,
or reg, reg
or r32, r32 : 2B OR
inst. clock/4insts. pattern $miss $hit -------- ----- ----- 2+2+2+2 1.0, 1.0 3+2+2+2 1.13 1.13 3+3+2+2 1.25 1.19 3+3+3+2 1.31 1.0 3+3+3+3 1.21 1.15 4+3+3+3 1.16 1.0 4+4+3+3 1.0 1.10 4+4+4+3 1.0 1.16 4+4+4+4 1.0 1.0So it seems there are some limitations regarding instruction count in 16B (or larger) code block, for both legacy decoder and uop cache. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2013-08-09 01:53 |
This looks like an alignment issue. The code is fetched in 16-bytes blocks. Instructions that cross a 16-bytes boundary (or 32-bytes boundary?) are decoded less efficiently. The µop cache is coupled to the instruction cache with a maximum of three 6-µop entries per 32 bytes block of code. How this translates to inefficiency when instructions with certain lengths execute out of the µop cache, I don't really understand.
I have done some experiments to test your claim that fuseable instructions decode less efficiently: xchg r8,r9 ; 3 µops. Decodes alone or eax,eax ; 1 µop, D0 or ebx,ebx ; 1 µop, D1 or ecx,ecx ; 1 µop, D2 or edx,edx ; 1 µop, D3This decodes in 2 clocks. If the last OR is changed to an AND, it decodes in 3 clocks. It will not put a fuseable arithmetic/logic instruction in decoder D3 because then it can't check in the same clock cycle if the next instruction is a branch. There is no effect when this executes out of the µop cache. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: anon | Date: 2013-08-09 04:50 |
Interesting. So it sounds like the odd rule also exists in the uop cache territory?
Here is another example:
or rax, 1 or rdx, 1 or rsi, 1 movaps xmm0, [r10] or rdi, 1 or r8, 1 movaps xmm1, [r11] or r9, 1This runs at 2 clocks / 8 instructions regardless of uop cache hit/miss. But if all ORs are changed into AND, it drops to 2.45 clocks / 8 instructions when the code isn't fit into the uop cache. Of course,
and rax, 1 and rdx, 1 and rsi, 1 movaps xmm0, [r10] and rdi, 1 and r8, 1 and r9, 1 movaps xmm1, [r11]This runs at 2 clocks / 8 instructions without problem. The result means not only that decode throughput of AND instruction is limited to 3 / cycle, but also that 4-1-1-1 pattern rule is applied to the instruction. This makes me believe that macro-fuseable instructions are only handled in simple decoders. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2013-08-10 01:44 |
There are at least two different issues here. One is, as you suggested, that the fuseable instructions don't go into the last decoder. The other is that short instructions don't go into the µop cache if they generate a total of more than 18 µops per 32 bytes of code. Maybe there is also an alignment issue. We will have to do some more experiments to test this. You can easily make instructions longer (up to 15 bytes) by adding dummy segment prefixes ( db 3EH ). |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2013-08-10 05:52 |
Now I have done some tests of the alignment effects. This explains the weird results I have seen earlier where the performance was improved when some instructions was made longer.
mov ebp, 100 align 32 LL: %rep 100 ; uops Bytes cmove eax,eax ; 2 3 cmove ebx,ebx ; 2 3 xchg r8,r9 ; 3 3 nop7 ; 1 7 nop7 ; 1 7 nop8 ; 1 8 nop ; 1 1 ; Total: 11 32 %endrep dec ebp jnz LL This takes almost 4 clocks. When I add a nop after align 32 to change the alignment by one byte, it takes only 3 clocks. The explanation is this. Each µop cache line can take 6 µops. The first two instructions take one µop cache line. The xchg instruction cannot cross a cache line so it starts in a new cache line. The next three instructions go in the same line, and the last nop takes a third line. Then there is a 32-bytes boundary and we start a new cache line. In total we need 300 cache lines, and there are only 256 lines in the µop cache. The loop doesn't fit into the µop cache, so the decoders become the bottleneck. When the alignment is changed, the last nop goes together with the two cmove instructions in the next iteration, and we need only 200 cache lines. Now it fits into the µop cache and the speed goes up. The same can be obtained by lowering the repeat count. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2013-10-09 13:14 |
First -- thank you very much for your performance work -- this is by far the most comprehensive and accurate compilation of microarchitecture and performance data that I have been able to find since I left the AMD processor design team at the end of 2008, and it has been very helpful in my ongoing studies of core performance issues. (Most of my prior work has been on memory systems and coherence protocols -- e.g., www.cs.virginia.edu/stream/ -- but now I am trying to learn more about core microarchitecture and performance and power). This note concerns the L1 Data Cache Banking on Intel's Sandy Bridge (and presumably Ivy Bridge) processors. Intel's Performance Optimization reference manual (document 248966-028, July 2013) says that Sandy Bridge cores will have an L1 Data Cache bank conflict if two loads ready for issue in the same cycle to two different cache lines in the same cache set have matching bits 5:2. This seems odd, since 5:2 is four bits and they are clear in reporting that there are only 8 banks. In the forum posts, the Intel employees were clearly not being permitted to disclose the full details, so my curiosity was aroused. The example code that they provide in section 3.6.1.3 (example 3-37) attempts to load two 32-bit items from the same offset within two different cache lines mapping to the same cache set. This does demonstrate bank conflicts, but not very many. (The loads can dual-issue after the first cycle -- so the code takes 5 cycles to perform the 8 loads instead of 4 cycles.) Repeating the loop a million times and using performance counter event BFh, umask 05h: L1D_BLOCKS.BANK_CONFLICT_CYCLES confirmed the stalls. Unfortunately the "corrected" version that they provide does not demonstrate that a difference in bits 5:2 will avoid a bank conflict. So I built a code similar to their example except that all 8 loads were to the same offset of 8 different cache lines that mapped to the same cache set. This gave a measured bank conflict rate close to my estimate of 7/8 (since there is no stall counted for the first of the 8 loads and the conflict continues for all loads after the first.) Then I modified the offsets so that the 8 loads were to consecutive 32-bit locations in 8 different cache lines that mapped to the same set. I.e., a stride of 17 32-bit words instead of 16 32-bit words. This gave zero conflicts and directly confirms that a difference in address bit 2 is enough to prevent a bank conflict (at least for 32-bit loads). That is quite an interesting result because it does not fit easily into the model of a cache having 64-bit wide or 128-bit wide banks (as you suggest in section 9.13 of your microarchitecture reference guide). My current hypothesis is that the cache has 8 banks that are each 32 bits wide, but run at twice the processor core frequency -- giving an effective width of 64 bits, but a granularity of access of 32 bits -- almost the same as having 16 banks. The main idea is that each bank can accept two addresses per cycle and deliver two 32-bit results from different lines, but with the critical limitation that it can only deliver the low-order 32-bits in the first half-cycle and can only deliver the high-order 32-bits in the second half-cycle. This combination of features is the only mechanism I could think of that retains the bank conflict seen when bits 5:2 match but which allows dual issue when bits 5:2 differ. Technologically, a double-speed cache appears possible -- experiments with the CACTI cache simulator (http://www.hpl.hp.com/research/cacti/) suggest that a 32 KiB cache of similar configuration should be able to run at up to about 7.5 GHz in a 32nm process technology, with an area similar to what I estimate from the Sandy Bridge die photos. I have reviewed many of the other possible combinations of alignment for a pair of loads and my hypothesis appears to provide a plausible explanation of the observed behavior. There are some problematic cases with combinations that include a 128-bit load on a 32-bit boundary where my model suggests a bank conflict even when bits 5:2 differ, but I am not sure that Intel's claims about the ability to dual-issue are intended to cover all such misalignments (and I have not coded any of these cases to see if they actually generate bank conflicts). This is part of work that I am doing to develop a set of microbenchmarks that can be used to document the behavior of hardware performance counters so that I can have some hope of using them to understand application characteristics. I have not had time to review your latency and throughput test codes yet, but I hope that with some modification (mostly controlling where the data is located when data motion instructions are executed) they will be useful in illuminating the specifics of what the performance counters are actually counting.... |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2013-10-10 10:34 |
Thank you for your comments John. I think it is unlikely that the cache could be running at double clock frequency. It is too big for that. Some previous models have run the cache at half clock frequency. Maybe your observations have something to do with the fact that the Sandy Bridge has two read ports? Have you tried on Haswell? It should have no cache bank conflicts |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2013-10-11 19:11 |
I am not an SRAM expert, but my experiments with CACTI suggest that a double-speed 32 KiB cache is possible in a 32 nm process. It is certainly possible that I am misunderstanding the results. It seems to me that the available documentation leaves a lot of uncertainty about how the cache ports relate to the cache SRAM banks. Intel's comments that a 16-Byte read may access as many as three banks strongly implies that the banks are 8 Bytes wide. Another possibility is that the addresses are "swizzled" somehow, but I have been unable to come up with a swizzling scheme that matches Intel's descriptions or my observations. Still a possibility, of course. I did not work on the core microarchitecture when I was at AMD, but my impression was that aligned 16 Byte loads in the Family 10h processors were serviced by a single bank. We don't have any Haswell systems at TACC --- I think that it will be the second half of next year before the two-socket Haswell-based servers are available. We will probably have access a bit earlier. I also read Intel's comments about the absence of bank conflicts in Haswell, and am looking forward to testing the new technology. |
Reply To This Message |
SB's L1D banks |
---|
Author: Tacit Murky | Date: 2013-11-03 03:29 |
Hello, John. In our (ixbt.com) low-level tests we have confirmed that L1D have 8-byte banks (that was also confirmed by SB arch. team engineer) with 5:3 bits allinged. Solving 4-byte accesses case is easy: OoO mem access (Intel term: MD) will reorder reads to issue them in different banks — A+0 & A+8, then A+4 & A+12, then same for next 4 reads and 2 banks, etc. (A = line's address). Also, delaying 1st access (having a «conflict» event for PMC), it's possible to issue all other loads without reordering, still having different banks: A+0 & (none), A+4 & A+8, A+12 & A+16… DDR for cache bit-lines is possible but removes possibility of (practically — need for) precharge. Without precharge bit-lines will have to swing 0<=>1 and back up to twice per clock. That requires fast (HT) transistors with high parameter uniformity (a Big Problem for 45 nm and bellow) and, most important, will ruin performance/watt metric for such a cache. And both Intel & AMD are avoiding this at all costs — like converting 6T bit-cells to 8T (for L1's and L2's) just to save power. But I'm still curious: how Intel resolved bank conflicts in Haswell. Naive solution is to make all banks 3-ported (2R+W), that would require 10T-cells. But early die-shots show just slighly larger L1D area cf. IB with same aspect ratio. Hm?… While we're at it, can I ask why AMD's memory controllers are so slow, especially on writes? Never can they achieve even 50% of theoretical peak throughput. Intel can do more. See AIDA64 «cache & memory benchmark» results, like this: www.easycom.com.ua/data/nouts/1302101905/img/38_aida64_memory-cache.jpg . |
Reply To This Message |
SB's L1D banks |
---|
Author: | Date: 2013-11-07 16:40 |
Thanks to Tacit Murky for the comments. I like the reordering trick, but it only works if you have accesses to different banks that can be re-ordered. In my original analysis I did not make this assumption. Consider, for example, performing a dot product on vectors of 32-bit values, each with a stride of 64B, and with a modulo-64 offset of 4 Bytes. Every load will access the same bank, so I think this case will have lots of conflicts, but every pair of loads differs in bit 2, so the pairs do not match in bits 5:2 and therefore (according to the wording of the optimization reference manual section 3.6.1.3, page 3-43) should *not* experience bank conflicts. I had intended to test this particular case, but now that I look at my code I see that my code with offsets does roll over all of the banks (using a stride of 68 Bytes), so the reordering trick is sufficient to explain the observed drop in bank conflicts. Concerning the write bandwidth on the AMD processors: Recent Intel processors (Nehalem & newer) have 10 "Line Fill Buffers" per core, and use these for streaming stores. In contrast, the AMD Family 10h processors have 8 "Miss Address Buffers" that are used for cacheable L1 misses (load or store) and 4 separate "Write Combining Buffers" that are used for streaming stores. This gives the AMD Family 10h processor significantly less potential concurrency for stores. Unfortunately it is quite difficult to estimate the amount of time that a buffer needs to be occupied for a streaming store operation, so it is not obvious how to determine whether the streaming store performance is concurrency-limited. In both AMD and Intel systems, the buffers used by the cores to handle streaming stores will hand off the data to the memory controller at some point, so they will probably have shorter occupancy than what is required for reads (since the buffers have to track reads for the full round trip), but the specifics of the hand-off are going to be implementation dependent and I don't see any obvious methodology for estimating occupancy. Once the streaming stores have been handed off to memory controller queues things are even less clear, since the number of buffers in the memory controller does not appear to be documented, and the occupancy in those buffers will depend on details of the coherence protocol that are unlikely to be discussed in public. A brief look at the BIOS and Kernel Developer's Guide for the AMD Family 15h processors suggests that the cache miss buffer architecture has been changed significantly, but I have not worked through the details. I did find a note in AMD's Software Optimization Guide for Family 15h Processors (publication 47414, revision 3.06, January 2012) that says that Family 15h processors have about the same speed as Family 10h processors when writing a single write-combining stream, but may be slower when writing more than one write-combining stream. I have a few Family 15h boxes in my benchmarking cluster, but since our production systems are all currently Intel-based, I have not had much motivation to research the confusing bandwidth numbers that I obtained in my initial testing. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2015-08-18 09:45 |
Hi Agner, When I was doing some very fine-grained performance testing on Haswell (Xeon E5-2667 v3), I saw some anomalies that reminded me of your comments on the AVX "warm-up" period on Sandy Bridge. The test code is an L1-contained summation of a single vector. For N=2048 and 256-bit VADDPD instructions, it should take 512 cycles (plus some overhead). What I observed was (1) an initial "emulation" period of 4-7 iterations that took ~2200 cycles each, (2) a "transition" iteration that took over 31,000 cycles -- about 25,500 halted, and about 5500 active, (3) "normal" behavior of 512 or 516 cycles for the rest of the iterations (after subtracting the approximate overhead). I added an outer loop with a (non-256-bit) "spinner" to see how long it takes for the processor to revert to initial behavior. If the spinner between outer loop iterations was less than 1 millisecond, the subsequent inner iterations ran at full speed. If the spinner between outer loop iterations was more than 1 millisecond, the subsequent inner iterations showed the behavior above. This behavior occurs even if the core frequency is bound any of the available frequencies (except perhaps the lowest frequency -- I need to go back and double-check those results). Performance counters showed that the core was running at the requested frequency in each case (comparing actual and reference cycles gave the expected ratio). So this looks like a very low-level emulation of the 256-bit pipeline by forcing everything through the bottom 128-bit pipe, with a remarkably slow transition when the upper 128-bit pipe is enabled. Perhaps the current draw is so large that the chip has to wait for the voltages to settle, even with no frequency change? I did not look for evidence of the overhead of the transition in the other direction -- I assume it will be much quicker to turn off the upper 128-bit FP pipe than to turn it on. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2015-08-18 10:52 |
John D. McCalpin wrote:So this looks like a very low-level emulation of the 256-bit pipeline by forcing everything through the bottom 128-bit pipe, with a remarkably slow transition when the upper 128-bit pipe is enabled.Thank you for sharing your findings. I wonder if it is possible to distinguish between running at reduced speed and running in the lower 128 bit lane. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2015-08-24 11:22 |
My test code measured the elapsed TSC time (using RDTSCP) and used the performance counters to measure Unhalted Core Cycles and Unhalted Reference Cycles (with inline RDPMC instructions). The difference between TSC cycles and Unhalted Reference Cycles gives the number of halted cycles, and the ratio of Unhalted Core Cycles to Unhalted Reference Cycles gives the average core frequency while not halted (relative to the nominal frequency). This data is sufficient to clearly distinguish low-frequency operation from operation in a degraded performance mode. It might not be enough to identify operation with T-state throttling -- I have never tried to use that feature.... |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2015-08-25 00:28 |
John D. McCalpin wrote:This data is sufficient to clearly distinguish low-frequency operation from operation in a degraded performance mode.I think it runs at reduced frequency or with idle clocks in between. If it was running 256-bit instructions through the lower 128-bit unit, you would probably see half speed, not quarter speed, and double the number of retired µops. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2015-08-25 11:58 |
If it were only a matter of arithmetic I would also expect the code to run at 1/2 speed when using only the lower 128-bit pipe. However on Haswell the transfer of data from the upper 128 bits of the AVX registers to the lower 128 bits has a 3-cycle latency, and although this can be fully pipelined in software (at 1 op/cycle), it is easy to believe that a hardware emulation mode that is only intended to be run for a minute fraction of the total cycles might not fully pipeline the "cross-lane" transfer across multiple instructions. The uop count is a matter of how the engineers chose to implement the feature. If the implementation is internal to the functional unit, then it would not require extra uops, and I do not see any significant change in uop counts between the "slow" and "normal" phases. (The uop counts are elevated for the iteration that includes the transition, but it is not at all clear what is happening in that step.) I think I forgot to mention that, just as you noticed on Sandy Bridge, there are no "warm-up" effects when using scalar AVX operations or 128-bit SSE operations. (I did not check 128-bit AVX, but there is not a lot of reason for that to be different than 128-bit SSE.) My assumption that the data is running though the "lower" 128-bit pipe is based in part on the observation that the 128-bit pipeline is available at full speed at all times. From an implementation perspective running the 256-bit operations at a lower frequency does not make a lot of sense when there is a full-speed 128-bit pipeline ready to use. |
Reply To This Message |
Haswell upper128 power gating |
---|
Author: | Date: 2015-08-28 22:58 |
John D. McCalpin wrote:
What I observed was The huge number of halted cycles in (2) makes it likely that this isn't just a timer interrupt hitting one of your iterations or something. Probably not migrating from one core to another, either. I agree with your speculation that this is probably the core halting for voltages to settle after powering up the high half of the execution units. I would have thought it might be possible to keep doing (1) "emulation" while the upper half settled, but this shows that Haswell doesn't work that way. I wouldn't be so quick to assume that powering down the upper 128 doesn't halt for a lot of cycles, too. There will be some capacitance, so the supply voltage won't go to zero instantly. Garbage signals coming out of the upper128 vector units as the charge dissipates could well be a problem. Clearly there isn't gating to protect the rest of the execution unit from this, or emulation mode could continue while the upper128 powered up, and you'd go from (1) to (3) without a slow transition iteration. (not *that* slow, anyway.) If we're lucky, powering down the upper128 of the vector units won't slow down integer code that uses different execution units, even though the integer execution units are on the same ports as the vector execution units. So it would be useful to alternate xmm and ymm vector loops, and ymm with non-vector loops, to look for a difference in the number of halted cycles when the CPU decides to power down the upper128. Maybe the CPU's internal power management won't power down the upper128 unless the core is halted for another reason? Your 1ms of spin-loop threshold seems to rule that out, though. I assume the whole core halts, affecting both hyperthreads, because it's due to a physical process. I guess you could look for this effect by timing a loop repeatedly on the other hardware thread, and recording timestamps for anomalies. If the timestamp for an extra-slow iteration in one thread was close to the timestamp for the transition iteration in the 256b-vector loop, then you could conclude that the whole core halted.
I'm not surprised that this is unrelated to frequency. Power-gating the upper128 of the vector units is a win at any frequency. Saving power at max frequency allows you to stay at max turbo longer. (Not to mention battery life.) I think one compelling reason for doing it at a low level inside the execution units, rather than with special uops, is that Intel CPUs that support AVX also have a uop cache. You don't want to have to mark lines in the uop cache as "decoded for 128b-emulation" vs. "decoded for 256b vector units", and then potentially re-decode after powering up / down the upper128. OTOH, extra uops could be generated on the fly in the scheduler that follows the ROB (re-order buffer), when uops are converted from fused-domain to unfused domain. If that's how it works, these uops could be flagged as "internally generated" so the perf counters don't count them. They may need to be flagged this way anyway, for things to work correctly. I doubt Intel would add extra complexity just for perf-counter bookkeeping to hide the internals. You did look at all the different uop issue / execute / retire counters, some of which count in the fused domain, and some of which count unfused uops, right? You said you looked at uops dispatched to ports, so I guess that should cover the unfused domain. As you point out, it's a bit surprising that perf is worse than half. Pentium M had 64b execution units, and took longer for 128b vector ops, but only about twice as long. In that case, though, 128b vector ops decoded to 2 uops, instead of having shuffling within the execution unit. Maybe this emulation mode isn't fully pipelined, or the unusual latency creates write-back conflicts? If emulation mode was fairly efficient, the upper128 might never need to power on for 256b code that was limited by memory bandwidth, frontend (ROB not filling up), or insn latency rather than throughput. Even 1/4 perf might still be efficient enough for some cases. Maybe it would have taken more transistors to make emulation mode faster, and they decided it wasn't worth it to speed up the slow mode and be less aggressive in powering up the upper128. |
Reply To This Message |
Haswell upper128 power gating |
---|
Author: Agner | Date: 2016-01-16 03:23 |
Session number SPCS001 in Intel Developer Forum 2015 tells that the Skylake can power down the upper 128-bit half of the 256-bit execution engine when it is not used: myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5 This is presented as an innovation in Skylake. What John McCalpin has observed in previous processors is perhaps a different power-saving mechanism? |
Reply To This Message |
Haswell upper128 power gating |
---|
Author: | Date: 2016-01-29 13:51 |
The IDF Skylake presentation seems to be saying something quite different than powering down the upper 128-bit lanes. The slide says the AVX2 infrastructure is powered down when not in use -- it says nothing about lanes or about 128 bits -- and the presenter was pretty clear, saying that the whole AVX2 "area" was powered off. This does lead to some problems of interpretation, since it is not clear whether this means only the AVX2 extensions (and not AVX v1, which is also 256-bits wide) or whether the processor keeps (at least) one 64-bit FP pipeline powered up. One can imagine that the number of applications that use no AVX2 instructions is quite large, the number of applications that use no 256-bit registers is quite large, but the number of applications that use no floating-point at all is not nearly as large. Of course no hint is provided about the cost of the transition. So it looks like both Sandy Bridge, Haswell, and Skylake (client) turn off the upper 128 bits of the SIMD pipelines, but only the Haswell pays the ~10 microsecond stall when the upper lanes are turned on. It may not be a coincidence that of these processors, only Haswell uses in-package voltage regulators. One might speculate that the smaller in-package voltage regulators are unable to hold the voltage steady under large load increases, so powering up the upper 128-bit SIMD lanes requires a stall while the voltage recovers. The 10 microsecond stall is similar in magnitude to the stalls on p-state changes on earlier processors. I don't think that I have seen good measurements of the overhead of p-state changes in Xeon E5 v3 processors. Of course we know nothing about the nature of the power-saving modes in either Sandy Bridge or Haswell. For example, one might speculate that turning off the clocks to the upper 128-bit SIMD lanes (but leaving the power on) would produce less power saving, but also less voltage drop when the clocks are re-enabled. There are still some anomalies. Re-reading Agner's comments leaves me with the impression that he has not seen the ~10 microsecond stall on any processors tested. Is this correct? If so, which Haswell models were tested? I imagine a 10 microsecond stall could be very upsetting to some people working in real-time signal processing, so it would be nice to know which processors show this behavior and which do not. One would guess that Skylake will also experience a large stall when it needs to enable the AVX2 "area", but it is not clear how Intel is managing this transition. Looking forward, one would imagine that the power implications of enabling/disabling the 512-bit SIMD units in AVX-512 could lead to even larger disruptions? |
Reply To This Message |
Haswell upper128 power gating |
---|
Author: Agner | Date: 2016-01-30 01:23 |
John D. McCalpin wrote:the presenter was pretty clear, saying that the whole AVX2 "area" was powered off.I don't think there is a special "area" just for AVX2. The execution units are divided by functionality, not by instruction set. The commercial presentation was just simplifying things. Re-reading Agner's comments leaves me with the impression that he has not seen the ~10 microsecond stall on any processors tested. Is this correct? If so, which Haswell models were tested?It was not a stall, but a 14 µs period of reduced throughput for all 256-bit instructions, both AVX and AVX2. I have seen this only on Skylake. The processors tested are listed in my instruction tables (Haswell family 6 model 3C). |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2015-12-20 05:56 |
John D. McCalpin wrote:
Hi Agner,I am testing the Skylake processor now, and it has a warm-up period for 256-bit vector operations of approximately 65,000 clock cycles. During this period, all 256-bit instructions take ~4.5 times as many clock cycles as normally. After the warm-up period, the 256-bit instructions have the same latency and throughput as similar 128-bit instructions. I am not seeing the same phenomenon on any of the previous processors. Maybe you have a later version of Haswell. What is the CPUID and stepping number? |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2015-12-21 16:38 |
Most of my testing was on various Xeon E5-2660 v3 processors. The Xeon E5 v3 specification update says that these are CPUID 0x306F2, Stepping M1. I have seen the same behavior as recently as this morning on a system with Xeon E5-2680 v3 processors (same CPUID and stepping). Note that only the Xeon E5 v3 parts have the differentiated frequencies for 256-bit operation, so I would not be surprised if the "client" Haswell parts did not show this behavior. I repeated these tests on a Sandy Bridge (Xeon E5-2680) and found the same "half-speed" operation that you reported earlier. In my experiments the "half-speed" operation lasted for up to a few thousand cycles, but the transition to full speed operation incurred no stall cycles. It also appears that the "full speed" mode of operation is not retained as long -- even short (much less than 1 millisecond) periods of not using the 256-bit registers resulted in switching back to the slower mode. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2015-12-22 01:42 |
This is interesting. I can see this warm-up behavior on an Intel Skylake i7-6700. It seems to use the lower 128-bit lane twice for 256-bit instructions during a warm-up period of 14 µs before the 256-bit instructions can run at full speed. It goes back to the cold state after 675 µs of no 256-bit instructions. I have never seen this behavior on any other processor. It would be interesting to know which processors have this behavior and which ones have not. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2015-12-24 04:04 |
Hi all, I have three haswell machines, so I decided to test this phenomena on all of them. I'm not doing some exact measurements of cycles using MSR's, I used my old program that I created around two years ago to test FMA implementation of Bresenham's line algorithm. I calculate pixels of rasterized line and I measure the calculation duration using RDTSC, but the results are telling something about this warm-up effect. I am calculating the same line in 100 consecutive cycles, in ideal conditions, single iteration should take around 600 cycles. Now the results: machine 1: this is my personal desktop pc with Core i7-4770K, I bought it right when the haswell cpu's were released, in june 2013. now I don't remember exact results of my experiments when I created this program two years ago, but I think that this "third iteration slowdown" wasn't happening back then, only a few longer iterations at start and then the rest of the iterations were fast. machine 3: this is new powerful workstation that we bought in work with Core i7-5820k. Here the results are quite different So, my uneducated guess is that this behavior might be caused also by the different versions of microcode. I might try downgrading the bios of my desktop and try these tests with older microcode |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2015-12-25 15:10 |
Why microseconds ? It would be more precise if you measured the difference using different frequency as most likely it is measured internally in cycles. Something like internal counter to shut down or power up unofficial ports (which rises the question how many of them are there, actually). Switching time is most likely limited by the time necessary to flush the internal compiled code (as confirmed by some polymorphic code tests) and probably the interpreter (decoder), should they decrease the timeout, there might be some decent performance drop. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2015-12-26 01:01 |
Just_Coder wrote:Switching time is most likely limited by the time necessary to flush the internal compiled code56,000 clock cycles at 4 GHz. This clock count is too high to be explained by a pipeline flush. More likely it is the time needed to power up the circuits and charge some internal capacitors. |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: | Date: 2015-08-23 00:35 |
From recent testing - some uncertainties - do you think partial decoding (instruction length and such) takes place on the stage of filling the cache (L1i) ? |
Reply To This Message |
Test results for Intel's Sandy Bridge processor |
---|
Author: Agner | Date: 2015-08-25 00:12 |
Just_Coder wrote:From recent testing - some uncertainties - do you think partial decoding (instruction length and such) takes place on the stage of filling the cache (L1i) ?Instruction boundaries are marked in the instruction cache on AMD processors but not on most Intel processors |
Reply To This Message |
Threaded View | Search | List | List Messageboards | Help |