Agner`s CPU blog

Test of Skylake-X and Goldmont - Agner - 2018-04-25

Test of Skylake-X and Goldmont - Alex Yee - 2018-04-25

Test of Skylake-X and Goldmont - Agner - 2018-04-26

Test of Skylake-X and Goldmont - Adrian Bocaniciu - 2018-04-26

Test of Skylake-X and Goldmont - Alex Yee - 2018-04-26

Test of Skylake-X and Goldmont - Slacker - 2018-05-03

Ryzen refresh and cache latencies - Slacker - 2018-05-03

Test of Skylake-X and Goldmont - Joe Duarte - 2018-07-19

Test of Skylake-X and Goldmont - Agner - 2018-07-20

Test of Skylake-X and Goldmont - Nancy - 2018-07-25

Test of Skylake-X and Goldmont - Agner - 2018-07-26

Test of Skylake-X and Goldmont - Goldmont Plus - 2018-09-09

Test of Skylake-X and Goldmont - Agner - 2018-09-09

Test of Goldmont Plus - Agner - 2018-09-15

Test of Goldmont Plus - Tremont - 2019-10-24

Tremont - Agner - 2019-10-24

Spreadsheet typo: VCOMPRESPS - Peter Cordes - 2019-05-05

Test of Skylake-X and Goldmont

Author: Agner

Date: 2018-04-25 11:51

The Skylake-X processor is a Skylake processor with added support for the new instruction set AVX512, including AVX512BW, AVX512DQ, AVX512VL, and AVX512CD.

The performance is identical to earlier Skylakes except for the new AVX512 instructions, different cache sizes, and different number of CPU cores.

The AVX512 instruction set doubles the number of vector registers to 32 and doubles the size of these vector registers to 512 bits. The bigger vector registers let you have sixteen single precision floats in one vector register. AVX512 also allows conditional execution of selected elements in a vector by the use of seven new mask registers. This is useful if the code contains branches.

The new Skylake variants have three vector execution units. Two of these units are 256 bits and the last is 512 bits. The two 256-bit units are combined when executing a 512 bit vector. This means that you can do two 512-bit vector calculations per clock cycle.

Floating point addition, multiplication, and fused multiply-and-add (FMA) instructions have a throughput of two 512-bit vectors on some versions and one on other versions. Early Skylake-X processors with less than ten cores can do one 512-bit floating point instruction per clock, while Skylake-X with 10+ cores and newer Skylakes can do two. They can all do two 256-bit floating point vector instructions per clock. These processors are otherwise identical and they all have the same CPUID family number so they are difficult to distinguish by software. I guess Intel has given high priority to FMA instructions on the biggest processors so that they can boast of a high FLOPS measure.

Historically, the first microprocessor version to support a new bigger vector size has usually had poor performance because it simply used half-size execution units twice when processing the big vector. With this history in mind, the Skylake-X is actually better than expected. Many 512-bit vector instructions have the same latency and throughput as the corresponding 256-bit instructions. Simple integer vector instructions, such as addition, has a throughput of three instructions per clock for 256 bits and two instructions per clock for 512 bits. But I think it is rare that you will have three vector instructions issued simultaneously anyway because there are likely to be narrower bottlenecks elsewhere, such as cache access or instruction decoding.

So the conclusion is that you can speed up CPU-intensive code by almost a factor two by using the new 512-bit vector registers if memory access and caching is not a bottleneck. There is not much software that supports AVX512 yet so you will have to compile the code yourself to get this performance boost. But don't be surprised if the result is disappointing. Instruction decoding and cache access are still very likely to be the limiting factors.

The Intel Goldmont is a successor of Atom and Silvermont. These are small low-power processors for less demanding applications. The Goldmont has full out-of-order execution and a maximum throughput of three instructions per clock cycle. This is certainly enough for a lot of applications, such as small portable computers and also for low traffic servers where power consumption is an issue. The Goldmont does not support the higher instruction set extensions. The vector registers are only 128 bits and the highest instruction set it supports is SSE4.2. There is no AVX, AVX2, or AVX512. It does have the encryption instructions AES and SHA, though.

All the details are in my microarchitecture manual and my instruction tables: agner.org/optimize/#manuals

[Edited: This posting originally said something wrong about Cannon Lake. Please disregard all comments about Cannon Lake in this thread]

Author: Slacker	Date: 2018-05-03 06:47
Adrian Bocaniciu wrote: For some weird reason, it appears that all the Xeon Gold and Platinum processors are derived from the 28-core die, even the models with only 4 or 6 enabled cores. The most likely reason is that the HCC die contains only two UPI links (inter-socket interfaces), while the XCC die has three. This is important for performance in quad-socket and larger configurations.

Author:	Date: 2018-07-19 00:51
Hi Agner â€“ Could you possibly test the performance of the Intel TSX instructions on relevant workloads? There seems to be little information on the web. See: https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

Author: Agner	Date: 2018-07-20 01:14
Joe Duarte wrote: Hi Agner â€“ Could you possibly test the performance of the Intel TSX instructions on relevant workloads? That's a lot to ask. It has to be tested in all kind of different scenarios and compared with other kinds of speculative multithreading. I'm afraid I can't find the time for such a project. I hope somebody else will.

Author:	Date: 2018-07-25 22:52
Lack of contents in "unit" column of Goldmont (instruction_tables.ods), why?

Author: Agner	Date: 2018-07-26 01:08
Nancy wrote: Lack of contents in "unit" column of Goldmont (instruction_tables.ods), why? Because I haven't found the information.

Author: Agner	Date: 2018-09-09 07:26
Goldmont Plus wrote: Agner, will you test Goldmont Plus? Most of the improvements over Goldmont apply to graphics, which I am not testing, but there seem to be some improvements in execution units and cache system that I can test if somebody gives me access to a Goldmont Plus.

Author: Tremont	Date: 2019-10-24 18:48
https://newsroom.intel.com/news/intel-introduces-tremont-microarchitecture/ Agner, can you please test Intel Tremont if you can get access to it?

Author: Agner	Date: 2019-10-24 22:45
Tremont wrote: Agner, can you please test Intel Tremont if you can get access to it? Yes, if somebody has a Tremont that they will give me (remote) access to.