Software optimization resources | E-mail subscription to this blog | www.agner.org
Threaded View | Search | List | List Messageboards | Help |
Test results for Knights Landing |
---|
Author: Agner | Date: 2016-11-26 08:39 |
The Knights Landing is Intel's new "Many Integrated Core" processor. It has 64-72 cores that can run four threads each. It is built with a 14 nm process and runs at a clock frequency of 1.3-1.5 GHz. It is intended for processing large data sets in parallel. It is only useful, of course, if the calculations can easily be split up into multiple threads that can run in parallel.
|
Reply To This Message |
Test results for Knights Landing |
---|
Author: | Date: 2016-11-26 19:07 |
Thanks for publishing this! Some comments and questions: > The Knights Landing has full out-of-order capabilities It may be worth mentioning that the Memory Execution Cluster is still in order. The Intel Architectures Optimization Reference Manual says "The MEC has limited capability in executing uops out-of-order. Specifically, memory uops are dispatched from the scheduler in-order, but can complete in any order. " More specifically, they mean "in order with regard to other memory operations": unlike other modern Xeons, a memory operation with an unfulfilled dependency will block successive memory operations from being dispatched even if their dependencies are ready. > The reservation stations have 2x12 entries for the integer lines, 2x20 entries The manual says "The single MEC reservation station has 12 entries, and dispatches up to 2 uops per cycle." I think this means that both 'memory lines' share a single 12 slot queue this is used for both scalar and vector operations? If so, might be good to be more explicit. > The Knights Landing has two decoders. The maximum throughput is two The manual says "The front end can fetch 16 bytes of instructions per cycle. The decoders can decode up to two instructions of not more than 24 bytes in a cycle. ", and then later says "The total length of the instruction bytes that can be decoded each cycle is at most 16 bytes per cycle with instructions not more than 8 bytes in length. For instruction length exceeding 8 bytes, only one instruction per cycle is decoded on decoder 0." I haven't figured out when the 24B limit would apply. Have you? Also, is it correct that these limits are not affected by alignment? > The throughput is limited to two instructions per clock cycle By emphasizing that it's a 'single µop', do you mean that the same µop is first sent to the memory unit and then, when the data is ready, (Figure 16-2) sent to the Integer or FP rename buffer as appropriate? And thus since the renamer can handle only two instructions per cycle, this means that unlike other Xeon the use of 'read-modify' (aka 'load-op') instructions does not usually help to increase µop throughput? > There is no penalty for the 2- and 3-byte VEX prefixes and 4-byte EVEX prefixes It's fairly easy to check that all the instructions are encoded to less than 8B, but I'm not sure I know how to correctly count the prefixes. Do you have any examples of common cases where this would be a problem? > The Knights Landing has no loop buffer, unlike the Silvermont. This means The corollary to this is that loop unrolling can be a win on KNL even when it would be counterproductive on a Xeon with a loop buffer. > These are forked after the register allocate and renaming stages into two the integer unit I think there is a missing word after 'two'? > A 64-bit register can be cleared by xor'ing the corresponding 32-bit register with itself. Hmm, interesting find. Do you see any reason they would choose to recognize the 32-bit idiom but not the 64-bit? Or just oversight? > The processor can do two memory reads per clock cycle or one read and I can't tell from this (or the Intel manual) which "vector plus scalar" memory operations are possible in the same cycle. Do you know if it can read a vector and a scalar in the same cycle? Read one and write the other? > The latency from the mask register, k1, to the destination is 2 clock cycles. I'm not understanding what you are measuring here. Are you saying that if have an instruction that changes the mask or the input, immediately followed by a masked read, that you will have an effective load latency of 2 + 5? Or something else? > The {z} option makes the elements in the destination zero, rather than Does this dependency occur only when explicitly using a mask, or does it matter in other cases too? That is, does {z} make a difference without a mask? > The best solution in such cases may be to turn off hyper-threading in the BIOS setup. Do you think there is a difference between turning off hyper-threading in BIOS, rather than telling the OS to disable someo f the cores at runtime? The manual says "Hard partitioning of resources changes as logical processors wake up and go to sleep", which makes me think that OS level would be fine, although I'm not sure exactly what it means by 'sleep'. > 15.11 Bottlenecks in Knights Landing As shown in your Instruction Table, the extreme cost of some of the "byte oriented" vector instructions might be worth calling out. The ones that stood out for me were VPSHUFB y,y,y with latency 23 and reciprocal throughput 12 (versus 1 and 1 for Skylake) and PMOVMSKB r32,y with latency 26 and reciprocal throughput 12 (versus 2-3 and 1 for Skylake). Specific mention of some of these might be helpful, since while these are still technically supported, it's unlikely that you'd want to use an algorithm that depends on them. There are a couple passages in the manual that I haven't been able to decipher. Do you perhaps understand this one, which (among other things) pertains to shuffles and permutes? "Some execution units in the VPU may incur scheduling delay if a sequence of dependent uop flow needs to use these execution units, these outlier units are indicated by the footnote of Table 16 2. When this happens, it will have an additional cost of a 2-cycle bubble." The other line that scares me in the manual is this one: "The decoder will also have a small delay if a taken branch is encountered." Did you happen to figure out how long this delay is? Thanks again for making this great research available! I'm sure a lot of people will greatly appreciate it. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: | Date: 2016-11-27 00:45 |
Looks like a typo in the KNL recip-throughput number for FMA - it's currently 3. KNF and KNC get 1, and this chip is a real FMA machine - it's designed around that unit. Pretty sure the correct number for KNL is 0.5 (like VADDPS and VMULPS). As for why the chip has 4 threads per core - I'm the guy that persuaded KNF (and thus KNC) to have 4 threads, and the reason is first to hide memory misses, second branch mispredicts, and third instruction latencies. Those are all huge bottlenecks in real-world performance. Yes, you can also hide them with huge OOO machines, wide decoders, and long pipelines, but when flops/watt is your efficiency metric, those aren't the first choice. The Knights line of chips already open the book with "we assume you have 70+ threads of stuff to do", and while getting from 1 thread to 4 is agony, and 4 threads to 16 is hard, getting from 70 threads to 280 is actually pretty simple. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: Søren Egmose | Date: 2016-11-27 03:13 |
Considering that OpenPower already supports 8 hardware threads per core this is apparently an acceptable way to go. This also opens a new approach to multithreading where you can have the threads on a single core cooperate to solve the problems as a small pack. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: Agner | Date: 2016-11-30 04:33 |
Tom Forsyth wrote:Looks like a typo in the KNL recip-throughput number for FMA - it's currently 3. KNF and KNC get 1, and this chip is a real FMA machine - it's designed around that unit. Pretty sure the correct number for KNL is 0.5 (like VADDPS and VMULPS).You are right. My mistake. Fused multiply-and-add has a throughput of two instructions per clock. As for why the chip has 4 threads per core - I'm the guy that persuaded KNF (and thus KNC) to have 4 threads, and the reason is first to hide memory misses, second branch mispredicts, and third instruction latencies. Those are all huge bottlenecks in real-world performance. Yes, you can also hide them with huge OOO machines, wide decoders, and long pipelines, but when flops/watt is your efficiency metric, those aren't the first choice.Thanks for clarifying the reason. Running 4 threads in an in-order core makes sense. Do you think that 4 threads is still useful in the out-of-order KNL? Nathan Kurz wrote: The Knights Landing has full out-of-order capabilitiesMemory operations are scheduled in order but executed out of order, as I understand it. The manual says "The front end can fetch 16 bytes of instructions per cycle. The decoders can decode up to two instructions of not more than 24 bytes in a cycle. ", and then later says "The total length of the instruction bytes that can be decoded each cycle is at most 16 bytes per cycle with instructions not more than 8 bytes in length. For instruction length exceeding 8 bytes, only one instruction per cycle is decoded on decoder 0." I haven't figured out when the 24B limit would apply. Have you? Also, is it correct that these limits are not affected by alignment?I just tried. A block of two instructions of 24 bytes total can decode in a single clock cycle, but you cannot have consecutive blocks exceeding 16 bytes, and the average cannot exceed 16 bytes per clock. And yes, alignment matters. Decoding is most efficient when aligned by 16. There is probably a double buffer of 2x16 bytes. Read-modify, and read-modify-write instructions generate a single μop from the decoders, which is sent to both the memory unit and the execution unit.Yes, that's how I understand it. There is no penalty for the 2- and 3-byte VEX prefixes and 4-byte EVEX prefixes unless there are additional prefixes before these.You do not normally need any additional prefixes in front of VEX and EVEX prefixes. The only case is FS and GS segment prefixes for the thread environment blocks, and these blocks are usually accessed with integer instructions. But instructions without VEX and EVEX can be a mess of prefixes. Legacy SSSE3 instructions without VEX prefix all have a meaningless 66H prefix and a 2-byte escape code (0FH, 38H). An additional REX prefix is needed if register r8-r15 or xmm8-xmm15 is used. This gives a total of 4 prefix and escape bytes, which is more than the decoder can handle in a single clock cycle. Another example is the ADCX and ADOX instructions with 64-bit registers. A 64-bit register can be cleared by xor'ing the corresponding 32-bit register with itself.An optimizing compiler would use xor eax,eax rather than xor rax,rax because the former is shorter. But it may be an oversight. There is no difference in length between xor r8,r8 and xor r8d,r8d. I can't tell from this (or the Intel manual) which "vector plus scalar" memory operations are possible in the same cycle. Do you know if it can read a vector and a scalar in the same cycle?Yes, it can do a vector read and a g.p. register read in the same clock cycle. It can also do any combination of a read and a write. The latency from the mask register, k1, to the destination is 2 clock cycles. The latency from zmm1 input to zmm1 output is 2 clock cycles. I'm not understanding what you are measuring here. Are you saying that if have an instruction that changes the mask or the input, immediately followed by a masked read, that you will have an effective load latency of 2 + 5? Or something else?The latencies come in parallel, not in series, so the latency from a mask or destination register will be 2 if the memory operand is ready. The {z} option makes the elements in the destination zero, rather than unchanged, when the corresponding mask bit is zero. It is recommended to use the zeroing option when possible to avoid the dependence on the previous value. Does this dependency occur only when explicitly using a mask, or does it matter in other cases too? That is, does {z} make a difference without a mask?A masked instruction has a dependency on the destination register if there is a mask and no {z}. There is no dependency on the destination register if there is no mask or if there is a mask with a {z}. It doesn't make sense to put {z} when there is no mask. The best solution in such cases may be to turn off hyper-threading in the BIOS setup. Do you think there is a difference between turning off hyper-threading in BIOS, rather than telling the OS to disable some of the cores at runtime?There is no difference. The problem is that current Operating Systems are not handling hyper-threading optimally. In fact, the O.S. lacks the necessary information. The CPUID instruction can tell how many threads are sharing the same L1 or L2 cache, but it does not tell which threads are sharing a cache, and it does not tell if threads are sharing decoders, execution units, and other resources. The proper solution would be to make the CPUID instruction tell which resources are shared between which threads, and make the operating system give high priority threads an unshared core of their own. But still, the O.S. does not know what is the bottleneck in each thread. It may be OK to put two threads in the same core if the bottleneck is memory access, but not if the bottleneck is instruction decoding. The application programmer or the end user cannot control this either if programs are running in a multiuser system. The hardware designers have given an impossible task to the software developers. Do you perhaps understand this one, which (among other things) pertains to shuffles and permutes? "Some execution units in the VPU may incur scheduling delay if a sequence of dependent uop flow needs to use these execution units, these outlier units are indicated by the footnote of Table 16 2. When this happens, it will have an additional cost of a 2-cycle bubble"There is an extra latency of 1-2 clock cycles when, for example, the output of an addition instruction goes to the input of a shuffle instruction. The data have to travel a longer distance to these units that they call outliers. There is less latency when the output of a shuffle instruction goes to the input of another shuffle instruction. The decoder will also have a small delay if a taken branch is encountered. Did you happen to figure out how long this delay is?A taken branch has a delay of 2 clock cycles. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: | Date: 2016-12-03 23:57 |
Agner, what's the latency of the MCDRAM when used as main memory? |
Reply To This Message |
Test results for Knights Landing |
---|
Author: Agner | Date: 2016-12-04 23:37 |
Joe Duarte wrote:what's the latency of the MCDRAM when used as main memory?Approximately 200 clock cycles, I think. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: Constantinos Evangelinos | Date: 2016-12-05 16:59 |
Intel gives 150ns for MCDRAM and 125ns for main memory. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: | Date: 2016-12-06 11:19 |
The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode. It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.) The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal). For operation in "Flat" mode (MCDRAM as memory, located in the upper 16 GiB of the physical address space), with the coherence agent mapping in "Quadrant" mode (addresses are hashed to coherence agents spread across the entire chip, but each cache line is assigned to an MCDRAM controller in the same "quadrant" as the CHA responsible for coherence), my preferred latency tester gives values of 154ns +/- 1ns (1 standard deviation) for MCDRAM. These values are averaged over many addresses, with the variation mostly from core to core (with a few ns of random variability). My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents. For the same system in "Flat" "All-to-All" mode (addresses are hashed to coherence agents spread across the entire chip, with no special correlation between the location of coherence agents and the MCDRAM controller owning an address), the corresponding value is 156ns +/- 1ns (1 standard deviation). For the same system in "Flat" "Sub-NUMA Cluster 4" mode, the corresponding values are 150.5ns +/- 0.9ns (1 standard deviation) for "local" accesses, and 156.8ns +/- 3.1ns for "remote" accesses. Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles. (Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.) Run-to-run variability is typically small when using large pages, but there are certain idiosyncrasies that have yet to be explained. Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of "hops" required for coherence transactions in "Quadrant" and "SNC-4" modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available.... |
Reply To This Message |
Test results for Knights Landing |
---|
Author: Agner | Date: 2016-12-06 12:20 |
My measurements of memory latencies are higher. However, I don't have access to control the memory configuration (it is not my machine). My measurements use random memory addresses to avoid prefetching. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: | Date: 2016-12-08 12:21 |
I have done testing with random permutations and with the hardware prefetchers disabled (and with both at the same time), and the simple stride results with no HW PF match the permuted results with HW PF enabled once the permutation block gets big enough. I did these tests back in July, and we have changed a number of aspects of the system configuration since then, but I think that Transparent Huge Pages were enabled when I did these tests. I don't recall if this was before or after we disabled some of the C states. The "untile" frequency may also make a difference -- it automatically ramps up to full speed when running bandwidth tests, but when running latency tests the Power Control Unit may not think that the "untile" is busy enough to justify ramping up the frequency. Without knowledge of the tag directory hash, the processor placement, the MCDRAM hash, etc, it is challenging to make a lot of sense of the results. On KNC the RDTSC instruction had about a 5 cycle latency, so I was able to do a lot more with timing individual loads, and the single-ring topology made the analysis easier. There are more performance counters in the "untile" on KNL, but there is no documentation on the various box numbers are located on the mesh. There is some evidence that the CHA box numbers are re-mapped -- on a 32-tile/64-core Xeon Phi 7210 all 38 CHAs are active, but the six CHA's with anomalous readings are numbered 32-37. The missing core APIC IDs are not bunched up in this way. The stacked memory modules have slightly higher latency because they are typically run in "closed page" mode, and because there is an extra set of chip-to-chip crossings. HMC (and Intel's MCDRAM) have an extra SERDES step between the memory stack and the processor chip. There are many different approaches used to error-checking on SERDES, but it is probably safe to expect that error-checking will require at least some added latency. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: | Date: 2016-12-07 15:44 |
Interesting. Why are we not getting lower latency from these integrated memory modules? They're closer to the processor than DIMM mounted DRAM, yet we never seem to reap any latency reductions. I'm thinking not just of MCDRAM, but also HBM2 and smartphone SOCs. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: | Date: 2016-12-28 02:44 |
I'm confused by your statement: "AMD processors and later Intel processors (Skylake and Knights Landing) do not have such a state switch. They have no need for the VZEROUPPER" From this discussion it appears that VZEROUPPER is even more important in some situations with Skylake then e.g. with Haswell. |
Reply To This Message |
VZEROUPPER |
---|
Author: Agner | Date: 2016-12-28 03:57 |
zboson wrote:it appears that VZEROUPPER is even more important in some situations with Skylake then e.g. with Haswell.
I have no access to a Skylake at the moment, but it may behave the same. This means that VZEROUPPER is still needed on all Intel processors, except Knights Landing. |
Reply To This Message |
Test results for Knights Landing |
---|
Author: | Date: 2017-07-13 04:00 |
As always, many thanks for your work with the instruction timings and evaluations, they are truly helpful. I have one question regarding the permutations on KNL. As you can see in your instruction manual and corroborated in "Intel Xeon Phi Processor High Performance Programming", there is around 2X difference in throughput and 33% in latency when using a two source or one source permutation/shuffle with the one source variant being superior. I am interested in permutations on double precision operands to be more precise. Scanning through your instruction table, I see that you have added vpermpd as only working on ymm registers which presumably is a one source permutation. However, in the intrinsics guide, I can see that you can generate a vpermpd using zmm's with _mm512_mask_permutex_pd where you can use a mask to keep results from the source argument (which I think gets overwritten) and shuffles across 256bit lane in the other operand. This decays to vpermpd zmm {k}, zmm, imm. As far as I can tell, this is a one source operand permutation which will have a latency between 3-6 cycles and a rec throughput of 1, am I correct? From my understanding, the two source operands are the intrinsics that decay into vpermpd zmm {k}, zmm, zmm such as _mm512_maskz_permutexvar_pd or the VSHUFF64X2 instructions. Your clarification between these would be very helpful. Kind regards, |
Reply To This Message |
Test results for Knights Landing |
---|
Author: Agner | Date: 2017-07-13 09:58 |
That's right. A 'v' in my instruction tables represents a vector register of any size. Intel's Software Developer’s Manual tells which intrinsics correspond to which instructions. |
Reply To This Message |
INC/DEC throughput |
---|
Author: | Date: 2017-10-09 23:52 |
Everything except your table says 1c throughput for INC/DEC, not 0.5c, for both KNL and Silvermont, and Intel's manual explains why. InstLatx64 for Silvermont says 1 per clock INC r32 throughput, but 0.5c ADD r32, imm8. Same for Goldmont (except that ADD r32,imm8 is 0.33c). I haven't found InstLatx64 results for actual KNL, but Intel's optimization manual describes KNL as having the same flag-merging extra uop for INC/DEC as Silvermont, so I expect that INC is worth avoiding on KNL as well when you're bottlenecked on issue throughput instead of decode. (Is there any IDQ between decode and issue to let the expansion from 1 uop decoded / 2 uops issued fill decode bubbles?) If so, it could still have an apparent cost of 0.5c in the right circumstances. According to Intel's optimization manual, 17.1.2 Out-of-Order Engine (KNL): Additionally, some instructions in the Knights Landing microarchitecture will be decoded as one uop by (lightly edited the list to group related instructions better.) This means there's *always* a flag-merging uop, even when nothing reads INC's flags. And it explains why the sustained throughput of INC/DEC is only 1, not 0.5, according to measurements other than yours, and according to Intel's published tables. It's nice that the integer register update itself doesn't have a false dep on flags, though, so they made it a lot less bad than P4. |
Reply To This Message |
INC/DEC throughput |
---|
Author: Agner | Date: 2017-10-10 04:05 |
Peter Cordes wrote:Everything except your table says 1c throughput for INC/DEC, not 0.5c, for both KNL and SilvermontYou are right. It will be corrected in the next update. NEG and NOT have double throughput, but not INC and DEC. |
Reply To This Message |
Threaded View | Search | List | List Messageboards | Help |