Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Test of Skylake-X and Goldmont - Agner - 2018-04-25
replythread Test of Skylake-X and Goldmont - Alex Yee - 2018-04-25
reply Test of Skylake-X and Goldmont - Agner - 2018-04-26
last replythread Test of Skylake-X and Goldmont - Adrian Bocaniciu - 2018-04-26
reply Test of Skylake-X and Goldmont - Alex Yee - 2018-04-26
last reply Test of Skylake-X and Goldmont - Slacker - 2018-05-03
reply Ryzen refresh and cache latencies - Slacker - 2018-05-03
replythread Test of Skylake-X and Goldmont - Joe Duarte - 2018-07-19
last reply Test of Skylake-X and Goldmont - Agner - 2018-07-20
replythread Test of Skylake-X and Goldmont - Nancy - 2018-07-25
last replythread Test of Skylake-X and Goldmont - Agner - 2018-07-26
last replythread Test of Skylake-X and Goldmont - Goldmont Plus - 2018-09-09
reply Test of Skylake-X and Goldmont - Agner - 2018-09-09
last replythread Test of Goldmont Plus - Agner - 2018-09-15
last replythread Test of Goldmont Plus - Tremont - 2019-10-24
last reply Tremont - Agner - 2019-10-24
last reply Spreadsheet typo: VCOMPRESPS - Peter Cordes - 2019-05-05
 
Test of Skylake-X and Goldmont
Author: Agner Date: 2018-04-25 11:51
The Skylake-X processor is a Skylake processor with added support for the new instruction set AVX512, including AVX512BW, AVX512DQ, AVX512VL, and AVX512CD.

The performance is identical to earlier Skylakes except for the new AVX512 instructions, different cache sizes, and different number of CPU cores.

The AVX512 instruction set doubles the number of vector registers to 32 and doubles the size of these vector registers to 512 bits. The bigger vector registers let you have sixteen single precision floats in one vector register. AVX512 also allows conditional execution of selected elements in a vector by the use of seven new mask registers. This is useful if the code contains branches.

The new Skylake variants have three vector execution units. Two of these units are 256 bits and the last is 512 bits. The two 256-bit units are combined when executing a 512 bit vector. This means that you can do two 512-bit vector calculations per clock cycle.

Floating point addition, multiplication, and fused multiply-and-add (FMA) instructions have a throughput of two 512-bit vectors on some versions and one on other versions. Early Skylake-X processors with less than ten cores can do one 512-bit floating point instruction per clock, while Skylake-X with 10+ cores and newer Skylakes can do two. They can all do two 256-bit floating point vector instructions per clock. These processors are otherwise identical and they all have the same CPUID family number so they are difficult to distinguish by software. I guess Intel has given high priority to FMA instructions on the biggest processors so that they can boast of a high FLOPS measure.

Historically, the first microprocessor version to support a new bigger vector size has usually had poor performance because it simply used half-size execution units twice when processing the big vector. With this history in mind, the Skylake-X is actually better than expected. Many 512-bit vector instructions have the same latency and throughput as the corresponding 256-bit instructions. Simple integer vector instructions, such as addition, has a throughput of three instructions per clock for 256 bits and two instructions per clock for 512 bits. But I think it is rare that you will have three vector instructions issued simultaneously anyway because there are likely to be narrower bottlenecks elsewhere, such as cache access or instruction decoding.

So the conclusion is that you can speed up CPU-intensive code by almost a factor two by using the new 512-bit vector registers if memory access and caching is not a bottleneck. There is not much software that supports AVX512 yet so you will have to compile the code yourself to get this performance boost. But don't be surprised if the result is disappointing. Instruction decoding and cache access are still very likely to be the limiting factors.

The Intel Goldmont is a successor of Atom and Silvermont. These are small low-power processors for less demanding applications. The Goldmont has full out-of-order execution and a maximum throughput of three instructions per clock cycle. This is certainly enough for a lot of applications, such as small portable computers and also for low traffic servers where power consumption is an issue. The Goldmont does not support the higher instruction set extensions. The vector registers are only 128 bits and the highest instruction set it supports is SSE4.2. There is no AVX, AVX2, or AVX512. It does have the encryption instructions AES and SHA, though.

All the details are in my microarchitecture manual and my instruction tables: agner.org/optimize/#manuals

[Edited: This posting originally said something wrong about Cannon Lake. Please disregard all comments about Cannon Lake in this thread]

   
Test of Skylake-X and Goldmont
Author: Alex Yee Date: 2018-04-25 16:25
Nice work Agner. I've been following your blog for years and it's almost always the go-to site for low-level hardware details. And I've lost track of how many people I've referred to your manuals over the years.

I wanted to point out that all the Skylake X chips (even the < 10 core-count) models have both 512-bit FMAs. At launch, Intel stated that they only had 1 FMA (no port5 FMA), but tests showed otherwise. Intel officially changed their docs in February to confirm that they all indeed have 2 FMAs. But they haven't said anything about how they got the initial docs wrong. (https://www.extremetech.com/computing/263963-intel-reverses-declares-skylake-x-cpus-two-avx-512-units)


Other questions:

1. Have you observed a longer latency for 512-bit instructions that go to the port5 FMA. Some people are measuring this to be 6 cycles (as opposed to 4). I believe Intel's documents also state something like this.

2. Have you tested the size of the register file in 512-bit mode? Intel says there are 168 vector registers. But they don't say how large they are. In some cases I've observed a significant performance difference between 256-bit and 512-bit code that is otherwise identical. The difference seems too large to be attributed to a 6-cycle port5 latency and memory access is all in L1. My hypothesis is that there may only be 84 renamed 512-bit registers if Intel decided to combine pairs of 256-bit registers to avoid doubling up the size/area of the register file. But I don't consider myself competent enough to try testing it myself.

3. For Cannonlake: I don't know where you got access to one to test, but it looks like some people are interested in the latency/throughput of the AVX512-VBMI byte-granular permute: https://twitter.com/InstLatX64/status/986935999640080384 - And do the AVX512-IFMA instructions have the same latency/throughput as the floating-point FMA instructions?

   
Test of Skylake-X and Goldmont
Author: Agner Date: 2018-04-26 05:14
Alex Yee wrote:
I wanted to point out that all the Skylake X chips (even the < 10 core-count) models have both 512-bit FMAs. At launch, Intel stated that they only had 1 FMA (no port5 FMA), but tests showed otherwise. Intel officially changed their docs in February to confirm that they all indeed have 2 FMAs. But they haven't said anything about how they got the initial docs wrong. (https://www.extremetech.com/computing/263963-intel-reverses-declares-skylake-x-cpus-two-avx-512-units)
I have tested a Skylake-X with 4 cores. It has a throughput of one 512-bit FMA per clock, at port 0. 256-bit FMA is two per clock at port 0+1.

A newer Skylake-X with 8 cores gave a throughput of two 512-bit FMA per clock, at port 0+5. 256-bit FMA is two per clock at port 0+1.

Have you observed a longer latency for 512-bit instructions that go to the port5 FMA.
I measured a latency of 4. But there is an extra latency when the previous or next instructions use a different unit.

Have you tested the size of the register file in 512-bit mode? Intel says there are 168 vector registers. But they don't say how large they are.
No, it is very difficult to measure. I don't know if they are 256 or 512 bits.

   
Test of Skylake-X and Goldmont
Author:  Date: 2018-04-26 09:19
Alex Yee wrote:
I wanted to point out that all the Skylake X chips (even the < 10 core-count) models have both 512-bit FMAs. At launch, Intel stated that they only had 1 FMA (no port5 FMA), but tests showed otherwise. Intel officially changed their docs in February to confirm that they all indeed have 2 FMAs. But they haven't said anything about how they got the initial docs wrong.
Actually there is no evidence that the "Skylake Server" chips with < 10 cores have dual 512-bit FMAs.
There are 3 kinds of "Skylake Server" dies: LCC (10-core), HCC (18-core) and XCC (28-core).

The original statement about a single 512-bit FMA referred to the LCC die, not to processors that are sold with <= 10 enabled cores.

In the document
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
the "Table 4" gives the die used for the chips sold as Xeon 3xxx, 4xxx, 5xxx, 6xxx or 8xxx.

For some weird reason, it appears that all the Xeon Gold and Platinum processors are derived from the 28-core die, even the models with only 4 or 6 enabled cores.

The Xeon Silver or Bronze are derived from the 18-core die or from the 10-core die, but those have only a single enabled 512-bit FMA unit, so there is no evidence that the 10-core and 18-core dies have a functional second 512-bit FMA unit.

I am not aware of any public information about the die used for Xeon W and Skylake X processors.
Nevertheless, Xeon W and Skylake X have characteristics very similar to Xeon Gold, so I suppose that all their models are also derived from the 28-core die, regardless of the number of enabled cores, even the 4-core models.
This supposition could be tested by reading the CAPID4 bits (see the document linked above; these bits give the die size) from a Skylake X processor, but I do not have one.

The initial announcement from Intel about the number of FMA units in Skylake X might be explained if they initially intended to derive the models with less cores from LCC or HCC, but finally they decided to derive all of them from XCC.

Whether it is true that the cores of the LCC die do not have the second FMA, I do not know. What seems to be certain is that Intel does not sell any processor derived from LCC with an enabled second FMA. Maybe they sell Xeon Bronze in so large quantities that it was worthwhile to have a different core layout for a smaller die area. Even stranger is the fact that it appears that also no HCC die is sold with an enabled second FMA, so maybe it also does not include the second FMA unit in the layout, to allow lower production costs for Xeon Silver.

If neither LCC nor HCC have the 2nd FMA, that would explain why 4-core Xeon Gold and probably also 4-core Xeon W and 4-core Skylake X, are derived from the 28-core die, even if that seems to be very wasteful, unless such dies with 23 or 24 bad cores are really frequent, so they could not be sold as processors with more cores.

   
Test of Skylake-X and Goldmont
Author: Alex Yee Date: 2018-04-26 11:22
Adrian Bocaniciu wrote:
Alex Yee wrote:
I wanted to point out that all the Skylake X chips (even the < 10 core-count) models have both 512-bit FMAs. At launch, Intel stated that they only had 1 FMA (no port5 FMA), but tests showed otherwise. Intel officially changed their docs in February to confirm that they all indeed have 2 FMAs. But they haven't said anything about how they got the initial docs wrong.
Actually there is no evidence that the "Skylake Server" chips with < 10 cores have dual 512-bit FMAs.
There are 3 kinds of "Skylake Server" dies: LCC (10-core), HCC (18-core) and XCC (28-core).

The original statement about a single 512-bit FMA referred to the LCC die, not to processors that are sold with <= 10 enabled cores.

In the document
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
the "Table 4" gives the die used for the chips sold as Xeon 3xxx, 4xxx, 5xxx, 6xxx or 8xxx.

For some weird reason, it appears that all the Xeon Gold and Platinum processors are derived from the 28-core die, even the models with only 4 or 6 enabled cores.

The Xeon Silver or Bronze are derived from the 18-core die or from the 10-core die, but those have only a single enabled 512-bit FMA unit, so there is no evidence that the 10-core and 18-core dies have a functional second 512-bit FMA unit.

I am not aware of any public information about the die used for Xeon W and Skylake X processors.
Nevertheless, Xeon W and Skylake X have characteristics very similar to Xeon Gold, so I suppose that all their models are also derived from the 28-core die, regardless of the number of enabled cores, even the 4-core models.
This supposition could be tested by reading the CAPID4 bits (see the document linked above; these bits give the die size) from a Skylake X processor, but I do not have one.

The initial announcement from Intel about the number of FMA units in Skylake X might be explained if they initially intended to derive the models with less cores from LCC or HCC, but finally they decided to derive all of them from XCC.

Whether it is true that the cores of the LCC die do not have the second FMA, I do not know. What seems to be certain is that Intel does not sell any processor derived from LCC with an enabled second FMA. Maybe they sell Xeon Bronze in so large quantities that it was worthwhile to have a different core layout for a smaller die area. Even stranger is the fact that it appears that also no HCC die is sold with an enabled second FMA, so maybe it also does not include the second FMA unit in the layout, to allow lower production costs for Xeon Silver.

If neither LCC nor HCC have the 2nd FMA, that would explain why 4-core Xeon Gold and probably also 4-core Xeon W and 4-core Skylake X, are derived from the 28-core die, even if that seems to be very wasteful, unless such dies with 23 or 24 bad cores are really frequent, so they could not be sold as processors with more cores.

Sorry, I'll clarify that when I said "Skylake X", I mean specifically the desktop Core i7 78--X and Core i9 79--X(E) models. All of them have the port5 FMA. The only SKUs with the disabled port5 FMA are a subset of the Xeons.

The <= 10 core models (7800X, 7820X, and 7900X) all use the LCC die. All the 12-18 core models (7920X - 7980XE) use the HCC die. This has been confirmed by numerous people in the overclocking community. Since the Skylake X SKUs are all overclockable, many people in the overclocking community delid them (remove the IHS) to improve the thermal conductivity when overclocking. In doing so, they can see the physical size of the die. So they know which die it is.

Here's a comparison of a delided 7900X (LCC) and a 7980XE (HCC): https://www.pcper.com/news/Processors/Intel-Skylake-X-18-core-Die-Pictured-Its-Massive

Since the 7800X, 7820X, and 7900X all have both FMAs and are all using the LCC die, this means the LCC die must (physically) have the port5 FMA. Likewise for the 7920X, 7940X, 7960X, and 7980XE with the HCC die.

In other words, all the dies (LCC, HCC, and XCC) physically have the port5 FMA. But it's disabled in a subset of the Xeon SKUs either for market segmentation and/or manufacturing yields.

Things are less clear for the Xeons since they aren't in the enthusiast market. (so nobody delids them) But it's not unreasonable to suspect that the single FMA SKUs are at least partially composed of reject dies with defective port5 FMA units.

   
Test of Skylake-X and Goldmont
Author: Slacker Date: 2018-05-03 06:47
Adrian Bocaniciu wrote:
For some weird reason, it appears that all the Xeon Gold and Platinum processors are derived from the 28-core die, even the models with only 4 or 6 enabled cores.
The most likely reason is that the HCC die contains only two UPI links (inter-socket interfaces), while the XCC die has three. This is important for performance in quad-socket and larger configurations.
   
Ryzen refresh and cache latencies
Author: Slacker Date: 2018-05-03 07:01
Since this appears to be the current designated manuals update discussion thread, I'll post some interesting info I found about the recent Ryzen refresh:

AnandTech wrote:


When AMD first launched the Ryzen 7 1800X, the L2 latency was tested and listed at 17 clocks. This was a little high – it turns out that the engineers had intended for the L2 latency to be 12 clocks initially, but run out of time to tune the firmware and layout before sending the design off to be manufactured, leaving 17 cycles as the best compromise based on what the design was capable of and did not cause issues. With Threadripper and the Ryzen APUs, AMD tweaked the design enough to hit an L2 latency of 12 cycles, which was not specifically promoted at the time despite the benefits it provides. Now with the Ryzen 2000-series, AMD has reduced it down further to 11 cycles. We were told that this was due to both the new manufacturing process but also additional tweaks made to ensure signal coherency. In our testing, we actually saw an average L2 latency of 10.4 cycles, down from 16.9 cycles in on the Ryzen 7 1800X.

The L3 difference is a little unexpected: AMD stated a 16% better latency: 11.0 ns to 9.2 ns. We saw a change from 10.7 ns to 8.1 ns, which was a drop from 39 cycles to 30 cycles.

This means that newer Ryzens have actually lower cache latencies than Skylake, despite having larger caches.

   
Test of Skylake-X and Goldmont
Author:  Date: 2018-07-19 00:51
Hi Agner – Could you possibly test the performance of the Intel TSX instructions on relevant workloads? There seems to be little information on the web. See: https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions
   
Test of Skylake-X and Goldmont
Author: Agner Date: 2018-07-20 01:14
Joe Duarte wrote:
Hi Agner – Could you possibly test the performance of the Intel TSX instructions on relevant workloads?
That's a lot to ask. It has to be tested in all kind of different scenarios and compared with other kinds of speculative multithreading. I'm afraid I can't find the time for such a project. I hope somebody else will.
   
Test of Skylake-X and Goldmont
Author:  Date: 2018-07-25 22:52
Lack of contents in "unit" column of Goldmont (instruction_tables.ods), why?
   
Test of Skylake-X and Goldmont
Author: Agner Date: 2018-07-26 01:08
Nancy wrote:
Lack of contents in "unit" column of Goldmont (instruction_tables.ods), why?
Because I haven't found the information.
   
Test of Skylake-X and Goldmont
Author: Goldmont Plus Date: 2018-09-09 04:43
Agner, will you test Goldmont Plus?

https://ark.intel.com/products/codename/83915/Gemini-Lake

https://www.anandtech.com/show/12146/intel-launches-gemini-lake-pentium-silver-and-celeron-socs-new-cpu-media-features

https://en.wikipedia.org/wiki/Goldmont_Plus

https://en.wikichip.org/wiki/intel/microarchitectures/goldmont_plus

   
Test of Skylake-X and Goldmont
Author: Agner Date: 2018-09-09 07:26
Goldmont Plus wrote:
Agner, will you test Goldmont Plus?
Most of the improvements over Goldmont apply to graphics, which I am not testing, but there seem to be some improvements in execution units and cache system that I can test if somebody gives me access to a Goldmont Plus.
   
Test of Goldmont Plus
Author: Agner Date: 2018-09-15 08:05
I have tested the Goldmont Plus now. Thanks to Kevyn for giving me access to this processor.

I have not tested the graphics, but for the execution unit and cache, there are only small changes from Goldmont. Floating point division and square root is improved a lot. AES encryption instructions are also improved. The level-2 cache is bigger, but cache latencies are approximately the same. The performance of taken jumps is strange on the Goldmont Plus. The throughput for jumps is better than on Goldmont if there is no more than one jump in each 16-bytes block of code, but worse if there are more than one jump in a 16-bytes block of code.

   
Test of Goldmont Plus
Author: Tremont Date: 2019-10-24 18:48
https://newsroom.intel.com/news/intel-introduces-tremont-microarchitecture/

Agner, can you please test Intel Tremont if you can get access to it?

   
Tremont
Author: Agner Date: 2019-10-24 22:45
Tremont wrote:
Agner, can you please test Intel Tremont if you can get access to it?
Yes, if somebody has a Tremont that they will give me (remote) access to.
   
Spreadsheet typo: VCOMPRESPS
Author:  Date: 2019-05-05 17:33
The SKX entry for VCOMPRESPS/PD only has one S in COMPRESS.

Also, we're missing test data for compress/expand with memory destination / memory source. www.uops.info/table.html shows it's 4 fused-domain uops for vcompressps on SKX / CNL; the store doesn't micro-fuse when it's part of a larger instruction, and both p5 uops are still needed. Or VEXPANDPS is a 3-uop instruction, load + 2p5.

www.uops.info/html-lat/SKX/VCOMPRESSPS_Z_ZMM_K_ZMM-Measurements.html shows the latency from ZMM -> ZMM is 3 cycles, but the latency from K1 input to ZMM output is 6 cycles. So presumably the internal implementation is 1 lane-crossing port-5 uop to generate a shuffle vector, and another lane-crossing shuffle to apply it.

---

Side note: are we ever going to get a full set of integer AVX2 instruction tests for Ryzen? pblendvb tests (even non-VEX) are actually missing from all AMD CPUs, not just Ryzen. But Ryzen and Excavator are missing `vperm2i128` and `vpermd`, among other things.